AD— UllO  535 

RAND  CORP  SANTA  MONICA  CA 

0CNERALIZ ABILITY  THEORY t  1973-1900 * (U) 

F/«  12/1 

UNCLASSIFIED 

JUL  01  R  J  SHAVELSONr  N  M  WEBB 

RAND/P-0500 

NL 

LEVELS  & 


GENERALI ZABILITY  THEORY:  1973-1980 


Richard  J.  Shavelson 
Noreen  M.  Webb 


July  1981 


P-658U 


82  02 


r 


ill 


PREFACE 


In  this  paper  we  review  generalizability  (G)  theory,  a  theory  of 
the  multifaceted  errors  of  a  behavioral  measurement.  The  review  was 
undertaken  at  the  request  of  Philip  Levy,  then  editor  of  the  Journal. 
His  idea  was  that  the  review  would  commemorate  the  first  article  on  G 
theory,  which  the  Journal  published  in  1963  (Cronbach,  Rajaratnam,  & 
Gleser,  1963).  For  these  and  personal  reasons,  we  undertook  the  re¬ 
view.  The  review  does  not  cover  the  period  1963-1972  because  that  has 
already  been  done  by  Cronbach,  Gleser,  Nanda,  and  Rajaratnam  (1972). 

In  Section  1  we  sketch  out  generalizability  theory  for  those  who 
are  not  familiar  with  it.  In  doing  so,  we  summarize  the  notation  used 
in  the  review.  Section  2  reviews  theoretical  contributions.  While  it 
primarily  reflects  what  has  been  published,  we  take  up  some  new  topics 
and  identify  others  in  need  of  treatment.  Section  3  presents  an  appli¬ 
cation  of  the  theory  in  some  detail.  This  application  illustrates 
basic  concepts  in  the  theory  (Section  1)  as  well  as  recent  theoretical 
contributions  (Section  2) . 
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ABSTRACT 

This  paper  reviews  the  developments  in  generalizability  theory 
from  1973  to  1980.  The  first  section  presents  a  sketch  of  generali¬ 
zability  theory.  The  second  section  reviews  theoretical  contributions 
about  (1)  problems  associated  with  estimating  variance  components, 
including  sampling  variability  and  negative  estimates,  (2)  fixed  facets, 
(3)  criterion-referenced  measurement,  (4)  symmetry,  (5)  multivariate 
generalizability,  and  (6)  sampling  in  observational  measurement.  The 
final  section  presents  an  illustrative  application  of  generalizability 
theory,  including  univariate  and  multivariate  generalizability  analyses 
of  balanced  and  unbalanced  designs,  and  Bayesian  estimation  of  variance 
components . 
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1.  SKETCH  OF  GENERALIZABILITY  THEORY 


Generalizability  theory  evolved  out  of  the  recognition  that  the 
concept  of  undifferentiated  error  in  classical  test  theory  provided  too 
gross  a  characterization  of  the  multiple  sources  of  error  in  a  measure¬ 
ment.  The  multidimensional  nature  of  measurement  error  can  be  seen  in 
how  a  test  score  is  obtained.  For  example,  one  of  many  possible  test 
forms  might  be  administered  in  one  of  many  possible  occasions  by  one  of 
many  possible  testers.  Each  of  these  choices — test  form,  occasion,  and 
tester — is  a  potential  source  of  error.  G  theory  assesses  each  source 
of  error  in  order  to  characterize  the  measurement  and  improve  its  design. 

A  behavioral  measurement,  then,  is  a  sample  from  a  universe  of 
admissible  observations,  characterized  by  one  or  more  facets  (e.g., 
test  forms,  occasions,  testers).^  This  universe  is  usually  defined  by 
the  Cartesian  product  of  the  levels  (called  conditions  in  G  theory)  of 
the  facets.  From  this  perspective,  Cronbach  et  al.  (1972,  p.  15)  say: 


The  score  on  which  the  decision  is  to  be  based  is  only  one 
of  many  scores  that  might  serve  the  same  purpose.  The 
decision  maker  is  almost  never  interested  in  the  response 
given  to  the  particular  stimulus  objects  or  questions,  to 
the  particular  tester,  at  the  particular  moment  of  testing. 
Some,  at  least,  of  these  conditions  of  measurement  could  be 
altered  without  making  the  score  any  less  acceptable  to  the 
decision  maker.  That  is  to  say,  there  is  a  universe  of  ob¬ 
servations,  any  of  which  would  have  yielded  a  usable  basis 
for  the  decision.  The  ideal  datum  on  which  to  base  the 
decision  would  be  something  like  the  person's  mean  score 
over  all  acceptable  observations,  which  we  shall  call  his 
"universe  score."  The  investigator  uses  the  observed  score 
or  some  function  of  it  as  if  it  were  the  universe  score. 

That  is,  he  generalizes  from  sample  to  universe.  The  ques¬ 
tion  of  "reliability"  thus  resolves  into  a  question  of  accu¬ 
racy  of  generalization,  or  generalizability. 


Since  different  measurements  may  represent  different  universes, 

G  theory  speaks  of  universe  scores  rather  than  true  scores,  acknowl¬ 
edging  that  there  are  different  universes  to  which  decisionmakers  may 
generalize.  Likewise,  the  theory  speaks  of  generalizability  coeffi- 
cient£  rather  than  the  reliability  coefficient,  realizing  that  the 
value  of  the  coefficient  may  change  as  definitions  of  universes  change. 
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In  G  theory,  a  person's  score  is  decomposed  into  a  component  for 

the  universe  score  (p  )  and  one  or  more  error  components.  To  illus- 

P 

trate  this  decomposition,  we  consider  the  simolest  case  for  pedagogical 
purposes — a  one  facet,  p  x  1  (person  by,  say,  test  form)  design.  (The 
object  of  measurement,  here  persons,  is  not  a  source  of  error  and, 
therefore,  is  not  a  facet.)  The  presentation  readily  generalizes  to 
more  complex  designs.  In  the  p  x  i  design  with  generalization  over 
all  admissible  test  forms  taken  from  an  indefinitely  large  universe, 
the  score  for  a  particular  person  (p)  on  a  particular  form  (i)  is: 


[1] 


x  -  U 

Pi 


+  p  -  p 
P 


+  U1  -  P 


+  v  -  "p  -  "i  +  “• 


(grand  mean) 
(person  effect) 
(form  effect) 
(residual) 


Except  for  the  grand  mean,  each  score  component  has  a  distribution. 

Considering  all  persons  in  the  population,  there  is  a  distribution  of 

p  -  p  with  mean  zero  and  variance  £(p  -  p)2  =  o2,  which  is  called 

P  P  P 

the  universe-score  variance  and  is  analogous  to  the  true-score  variance 

of  classical  theory.  Similarly,  the  component  for  test  form  has  mean 
zero  and  variance  £(p^  -  p)z  *  a2  which  indicates  the  variance  of  con¬ 
stant  errors  associated  with  test  forms,  while  the  residual  component 
has  mean  zero  and  variance  a2.  ,  which  indicates  the  person  x  form 

interaction  confounded  with  residual  error,  since  there  is  one  obser¬ 
vation  per  cell.  The  collection  of  observed  scores,  X  .,  has  a  vari- 

pi 

ance  a2  =£(X.-p)2  which  equals  the  sum  of  the  variance  components 
Xpi  pi 


a2  +  a2 
P  i 


+  a 


2 

pi,e 
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G  theory  focuses  on  these  variance  components.  The  relative 
magnitudes  of  the  components  provide  information  about  particular 
sources  of  error  influencing  a  measurement.  It  is  convenient  to  esti¬ 
mate  variance  components  from  an  ANOVA  of  sample  data.  Numerical  esti¬ 
mates  of  the  variance  components  are  obtained  by  setting  the  expected 
mean  squares  equal  to  the  observed  mean  squares  and  solving  the  set  of 
simultaneous  equations  as  shown  in  Table  1. 

Table  1 

ESTIMATES  OF  VARIANCE  COMPONENTS  FOR  A 
ONE  FACET,  p  x  i,  DESIGN 


Source  of 

Mean 

Expected  . 

Estimated 

Variation 

Square 

Mean 

Square 

Variance  Component 

Person  (p) 

MS 

P 

a2 

pi,  e 

+  no2 
i  P 

A  2 

- 

P 

(MS  -  MS  )/n 

p  res  i 

Form  (i) 

MS. 

l 

a2, 

Pi»e 

+  n  a2 

P  i 

Q  > 
H-  K 

II 

(MS,  -  MS 

i  res) /n 

P 

P  x  i,e 

MS 

res 

a2 

pi»e 

o2 

pi,e 

=  MS 

res 

n.  =  number  of  forms;  n  =  number  of  persons, 
i  P 


Variance  components  are  estimated  by  means  of  a  genera lizability 
(G)  study.  "The  instrument  developer,  carrying  out  a  G  study  to  guide 
users  of  his  instrument  will,  in  the  design  of  that  study,  treat  sys¬ 
tematically  the  facets  that  are  likely  to  enter  into  generalizations 
of  various  users"  (Cronbach  et  al.,  1972,  p.  21).  Ordinarily,  the 
universe  of  admissible  observations  is  defined  as  broadly  as  possible 
within  practical  and  theoretical  constraints.  In  most  cases.  Cronbach 
et  al.  recommended  using  a  crossed  design  so  that  all  of  the  variance 
components  can  be  estimated.  Cronbach  et  al.  (1972)  noted,  however, 
that  a  nested  G  study  is  sometimes  useful  because  it  provides  more 
degrees  of  freedom  for  some  estimates  of  variance  components. 


G  theory  distinguishes  a  decision  (D)  study  from  a  G  study.  This 
distinction  recognizes  that  certain  studies  are  associated  with  the 
development  of  a  measurement  procedure  (G  studies)  while  other  studies 
then  apply  the  procedure  (D  studies).  In  planning  the  D  study,  the 
decisionmaker  (a)  defines  the  universe  of  generalization  and  (b)  spec¬ 
ifies  his  proposed  interpretation  of  a  measurement.  These  plans  deter¬ 
mine  (c)  the  questions  to  be  asked  of  the  G  study  data  in  order  to 
optimize  the  measurement  design.  Each  of  these  points  is  considered 
in  turn. 

(a)  G  theory  recognizes  that  the  universe  of  admissible  obser¬ 
vations  encompassed  by  a  G  study  may  be  broader  than  the  universe  to 
which  a  decisionmaker  wishes  to  generalize.  That  is,  the  decision 
maker  proposes  to  generalize  to  a  universe  comprised  of  some  subset 
of  facets  in  the  G  study.  This  universe  is  called  the  universe  of 
generalization.  It  may  be  defined  by  reducing  the  universe  of  admis¬ 
sible  observations,  i.e.,  by  reducing  the  levels  of  a  facet  (creating 
a  fixed  facet;  cf.  fixed  factor  in  ANOVA) ,  by  selecting  and  thereby 
controlling  one  level  of  a  facet,  or  by  ignoring  a  facet.  All  three 
alternatives  have  consequences  for  the  estimation  of  the  components 
of  error  variance  that  enter  into  the  observed  score  variance. 

(b)  G  theory  recognizes  that  decision  makers  use  the  same  test 
score  in  different  ways.  For  example,  some  interpretations  may  focus 
on  individual  differences  (i.e.,  relative  or  comparative  decisions), 
some  may  use  the  observed  score  as  an  estimate  of  a  person's  universe 
score  (absolute  decisions;  cf.  criterion- referenced  interpretations), 
while  still  others  may  use  the  observed  score  in  a  regression  estimate 
of  the  universe  score  (cf.  Kelley's,  1947,  regression  estimate  of  true 
scores).  There  is  a  different  error  associated  with  each  of  these 
proposed  interpretations.  For  relative  decisions,  the  error  in  a 

p  x  i  design  is  defined  as: 

[3]  «pi  '  «„i  -  »!>  -  -  *■>, 


where  I  indicates  that  an  average  has  been  taken  over  the  levels  of 
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facet  i  under  which  p  was  observed.  The  variance  of  the  errors  for 
relative  decisions  are: 


[4] 


a2  =  a2  =  a2 
6  pi 


,  /n! 

pi.e  i 


where  n^’  indicates  the  number  of  conditions  of  facet  i  to  be  sampled 

in  a  D  study.  Notice  that  (a)  a2  /n. ’  is  the  standard  error  of  the 

p  1 9  6  X 

mean  of  a  person's  scores  averaged  over  the  levels  of  i  (test  forms  in 
our  example).  And  (b)  the  magnitude  of  the  error  is  under^the  control 
of  the  decisionmaker  in  the  D  study.  In  order  to  reduce  may 

be  increased.  This  is  analogous  to  the  Spearman-Brown  prophecy  for¬ 
mula  in  classical  theory  and  the  standard  error  of  the  mean  in  sampling 
theory . 

For  absolute  decisions,  the  error  is  defined  as: 


[5] 


The  variance  of  these  errors  in  a  p  x  i  design  is: 


o2  +  o2  = 
I  pi 


cj/nl 


3pi,e' 


o‘.  /n 


i‘ 


In  contrast  to  o2  ,  o2  includes  the  variance  of  constant  errors  asso- 
6  A 

ciated  with  facet  i  (o2).  This  arises  because,  in  absolute  decisions, 
the  difficulty  of  the  particular  test  forms  that  a  person  receives  will 
influence  his  observed  score  and,  hence,  the  decisionmaker's  estimate 
of  his  universe  score.  For  relative  decisions,  however,  the  effect 
of  test  form  is  constant  for  all  persons  and  so  does  not  influence 
the  rank  ordering  of  them  (see  Erlich  &  Shavelson,  1976b). 

Finally,  for  decisions  based  on  the  regression  estimate  of  a  per¬ 
son's  universe  score,  error  (of  estimate)  is  defined  as: 


[7] 


where  u  is  the  regression  estimate  of  a  person's  universe  score, 
P 


— - 
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The  estimation  procedure  for  the  variance  of  errors  of  estimate  may  be 
found  in  Cronbach  et  al.  (1972,  p.  97ff). 

(c)  D  studies  encompass  a  wide  variety  of  designs  including 
crossed,  partially  nested,  and  completely  nested  designs.  All  facets 
in  the  D  design  may  be  random  (cf.  random  model)  or  only  some  may  be 
random  (cf.  mixed  model).  Often,  in  D  studies,  nested  designs  are 
used  for  convenience,  for  increasing  sample  size,  or  both.  Forms  may 
be  nested  within  persons  (we  write  i:p  to  denote  nesting).  So,  the 
effect  of  the  constant  errors  associated  with  facet  i  is  confounded 
with  the  effect  associated  with  the  person  by  i-facet  interaction 
(pi,e).  Hence. 


[8] 


0  +  0T  T 

P  I.Pl.e 


+  o 


2 

A  * 


Note  that,  for  a  completely  nested  design,  a2  =  a2  . 

While  stressing  the  importance  of  variance  components  and  errors 

such  as  a2,  G  theory  also  provides  a  coefficient  analogous  to  the 
0 

reliability  coefficient  in  classical  theory.  The  genera lizability 
coefficient,  Cp2,  is  defined  as  the  ratio  of  the  universe-score  vari¬ 
ance  to  the  expected  observed-score  variance,  i.e.,  an  intraclass 
correlation: 


£a2(X)  a2  +  a2 
P  6 


The  expected  observed-score  variance  is  used  in  G  theory  because  the 
theory  assumes  only  random  sampling  of  the  levels  of  facets  and  so 
the  observed-score  variance  may  change  from  one  application  of  the 
design  to  another.  Sample  estimates  of  the  parameters  in  [9]  are 
used  to  estimate  the  G  coefficient: 

a2 


£p2  = 


/N  ^ 

+  °« 


P 


[9a] 
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£p2  is  a  biased  but  consistent  estimator  of  £p2. 

For  absolute  decisions  a  generalizability  coefficient  can  be  de¬ 
fined  in  an  analogous  manner: 


[10] 


[10a] 


£p2 


CP- 


>2 

P 


and 


Finally,  note  that,  for  completely  nested  designs  regardless  of  whether 
relative  or  absolute  decisions  are  to  be  made,  error  variance  is  de¬ 
fined  as  o2  and  so  [10]  provides  the  generalizability  coefficient  for 
such  designs. 

Generalizability  theory  has  been  applied  widely  in  the  behavioral 
sciences.  It  has  been  applied,  for  example,  in  studying  the  depend¬ 
ability  of  measures  of  the  behavior  of  schizophrenic  patients  (e.g., 
Mariotto  &  Farrell,  1979),  assertion  in  the  elderly  (Edinberg,  Karoly, 

&  Gleser,  1977),  free-recall  in  children  (Peng  &  Farr,  1976),  depth 
and  duration  of  sleep  (Coates,  Rosekind,  Strossen,  Thoresen,  &  Kirrail- 
Gray,  1979),  behavior  of  teachers  (e.g.,  Erlich  &  Shavelson,  1978), 
dentists'  sensitivity  toward  patients  (Gershen,  1976),  educational 
attainment  (Cardinet,  Tourneur,  &  Allal,  1976a),  job  satisfaction  using 
Spanish  and  English  forms  (Katerberg,  Smith,  &  Hoy,  1977),  student 
ratings  of  instruction  (e.g.,  Gillmore,  Kane,  &  Naccarato,  1978),  and 
heterosexual  social  anxiety  (Farrell,  Marco,  Conger,  Curran,  &  Wallander, 
1979). 

A  study  of  the  dependability  of  measures  of  psychological  improve¬ 
ment  of  disaster  survivors  (Gleser,  Green,  &  Winget,  1978)  illustrates 
the  theory's  treatment  of  multifaceted  measurement  error.  Twenty  adult 
survivors  (S)  were  interviewed  independently  by  two  interviewers  (I) 
in  order  to  obtain  data  on  the  extent  of  psychiatric  impairment  result¬ 
ing  from  the  disaster.  Two  raters  (R)  "quantified"  the  interview  data 
by  rating  each  survivor  on  a  number  of  subscales,  such  as  anxiety,  of 
the  Psychiatric  Evaluation  Form.  In  differentiating  survivors  with 


i  irn«B 
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respect  to  the  extent  of  impairment,  errors  in  the  measurement 
may  arise  from  inconsistencies  associated  with  interviewers,  raters, 
and  other  unidentified  sources.  G  theory  incorporates  these  potential 
sources  of  error  into  a  measurement  model  and  estimates  the  components 
of  variarance  associated  with  each  source  of  variation  in  the  20  x  2 
x  2  (S  x  I  x  R)  design.  Table  2  enumerates  the  sources  of  variation 
and  presents  the  estimated  variance  components  for  the  anxiety  subscale. 
Three  estimated  variance  components  are  large  relative  to  other  com¬ 
ponents.  The  first,  for  survivors,  is  analogous  to  true  score  variance 
in  classical  test  theory  and  is  expected  to  be  large.  The  second,  the 
survivor  by  interviewer  interaction,  represents  one  source  of  measure¬ 
ment  error  and  is  due  to  inconsistencies  of  the  two  interviewers  in 
obtaining  information  for  different  survivors.  The  third  is  the  resi¬ 
dual  term  representing  unidentified  sources  of  measurement  error.  The 
estimated  generalizability  coefficient,  p2,  is  0.56  using  two  inter¬ 
viewers  and  two  raters. 
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Table  2 

GENERALIZABILITY  OF  MEASURES  OF  PSYCHIATRIC 
IMPAIRMENT  OF  DISASTER  SURVIVORS 


Source  of 

Estimated  Variance 

Variation 

Component 

Survivors  (S) 

1.84 

Raters  (R) 

.21 

Interviewers  (I) 

.49 

SR 

.27 

SI 

1.82 

RI 

.03 

Residual  f'SRI.e) 

1.58 

Generalizability  Coefficient  (p2) 
for  one  rater  and  one  interviewer 

0.33 

Generalizability  Coefficient  (p^) 
for  two  raters  and  two  interviewers 

0.56 
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2.  THEORETICAL  CONTRIBUTIONS 


L.  Estimated  Variance  Components:  The  Achilles  Heel 

Two  major  contributions  of  generalizability  theory  are  its  emphasis 
on  the  multiple  sources  of  measurement  error  and  its  de-emphasis  of 
the  role  played  by  summary  reliability  or  generalizability  coefficients. 
Estimated  variance  components  are  the  basis  for  indexing  the  relative 
contribution  of  each  source  of  error  and  the  undependability  of  a  mea¬ 
surement.  Yet  Cronbach  et  al.  (1972)  warned  that  the  estimates  of  these 
variance  components  are  unstable  with  usual  sample  sizes  (cf.  Lindquist, 
1953;  Smith,  1978;  van  der  Kamp,  1976). 

While  we  consider  the  problems  associated  with  estimating  variance 
components  to  be  the  Achilles  heel  of  G  theory,  these  problems  afflict 
all  sampling  theories.  One  virtue  of  G  theory  is  that  it  brings  esti¬ 
mation  problems  to  the  fore  and  puts  them  up  for  examination.  Esti¬ 
mation  problems,  then,  are  the  Achilles  heel  of  all  theories  that  in¬ 
volve  sampling. 

With  the  importance  and  fallibility  of  estimated  variance  components 
so  clea-ly  recognized,  we  find  it  astonishing  that  so  little  attention 
has  been  given  to  this  topic  in  the  literature  on  G  theory.  Here  we 
review  the  few  studies  that  have  been  done  and  also  point  out  research 
in  the  statistical  literature  which  we  hope  will  stimulate  further  work. 

2.1.1.  Sampling  variability  of  estimated  variance  components 

Usually  the  estimate  of  a  variance  component  (a2)  is  found  using 
some  linear  combination  of  mean  squares  divided  by  a  constant.  The 
sampling  variance  of  an  estimated  variance  component  (a2)  is: 

[111  VAR(o2)  = 


£vAR(MSq) 


L 
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where  c  Is  Che  consCant  associated  with  the  estimated  variance  component, 
£MS^  is  the  expected  value  of  the  mean  square,  MS^,  and  df^  is  the  de¬ 
grees  of  freedom  associated  with  MS  .  (See  Smith,  1978,  for  a  lucid, 
terse  development  of  this  general  formula.)  For  example,  in  Table  1, 
the  variance  component  for  persons  is  estimated  by  (MS^  -  MSres)/nj. 

With  respect  to  [11],  c  refers  to  n . ,  MS  refers  to  MS  ,  and  MS 

i  q  p  res 

The  problem  of  fallible  estimates  can  be  illustrated  by  expressing 
a  mean  square  as  a  sum  of  population  variances.  In  a  two-facet,  crossed 
(p  x  i  x  j),  random  model  design,  the  variance  of  the  estimated  variance 


component  for  persons — of  the  estimated  universe  score  variance — is 
(Smith,  1978,  Fig.  1): 


With  all  of  the  components  entering  the  variance  of  the  estimated  uni¬ 
verse  score  variance,  o2  ,  the  fallibility  of  such  an  estimate  is  quite 
apparent  (if  n^  and  n.  are  modest).  In  contrast,  the  variance  of  the 

estimated  residual  variance  (a2  )  has  only  one  variance  component. 

res 
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In  a  crossed  design,  then,  the  number  of  components  and  hence  the  vari¬ 
ance  of  the  estimator  increases  from  the  highest  order  interaction  com¬ 
ponent  to  the  main  effect  components.  Consequently,  sample  estimates 
of  the  universe-score  variance — estimates  of  crucial  importance  to  the 
dependability  of  a  measurement — may  reasonably  be  expected  to  be  less 
stable  than  estimates  of  components  of  error  variance. 

The  sampling  variability  of  estimated  variance  components  leads 
to  a  bandwidth-fidelity  dilemma.  Perhaps  the  greatest  contribution  of 
G  theory  is  its  applicability  to  complex,  realistic,  multifaceted 
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measurement  designs.  Hence,  G  theory  provides  greai  bandwidth  in 
examining  the  dependability  of  behavioral  measurements.  However,  for 
complex  multifaceted  measurements  the  variability  of  the  estimated 
variance  components  is,  in  general,  expected  to  increase.  Hence, 
fidelity  in  estimation  is  lower  for  multifaceted  universes  than  for 
single  faceted  universes.  The  bandwidth-fidelity  dilemma  is  a  dilemma 
associated  with  all  applied  statistics,  not  just  with  G  theory. 


2.1.2.  Negative  estimates  of  variance  components 


Negative  estimates  of  variance  components  can  arise  because  of 
sampling  errors  or  because  of  model  misspecification  (Hill,  1970). 
With  respect  to  sampling  error,  the  one-way  ANOVA  illustrates  how 
negative  estimates  can  arise.  The  expected  mean  squares  are: 


CMS,,  =  o?  and 

w  1 


,„Sb  =  a2  +  no*  =  o*2> 


where  CMSy  is  the  expected  value  of  the  mean  square  within  groups  and 

CMS  is  the  expected  value  of  the  mean  square  between  groups.  Esti- 
B 

mation  of  the  variance  components  is  accomplished  by  equating  the  ob¬ 
served  mean  squares  with  their  expected  values  and  solving  the  linear 
equations  (see  Brennan,  1978a,  for  algorithms).  If  MS^  is  larger  than 
MS,.,  the  estimate  of  o*  will  be  negative. 

Realizing  this  problem  in  G  theory,  Cronbach  et  al.  (1972,  p.  57) 
suggested  that  "a  plausible  solution  is  to  substitute  zero  for  the  nega¬ 
tive  estimate,  and  carry  this  zero  forward  as  the  estimate  of  the  com¬ 
ponent  when  it  enters  any  equation  higher  in  the  table  of  mean  squares." 
[See  Davis  (1974,  pp.  15-17)  for  a  summary  of  approaches  to  treating 
negative  estimates.]  Notice  that  by  setting  negative  estimates  of 
variance  components  to  zero,  the  researcher  is  stating  that  a  reduced 
model  provides  an  adequate  representation  of  the  data,  thereby  admitting 
that  the  original  model  was  misspecified.  Scheffe  (1959),  among  others, 
has  pointed  out  that  while  this  is  a  reasonable  solution  to  the  problem, 
the  sampling  distribution  of  the  (once  negative)  variance  component  as 
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well  as  those  variance  components  whose  calculation  includes  this  com¬ 
ponent  is  more  complicated  and  the  modified  estimates  are  biased.  The 
problem  of  handling  negative  variance  components  is  exacerbated  in  G 
studies  with  many  crossed  facets  as  [12]  suggests. 

A  Bayesian  approach  to  estimating  variance  components  appears  to 
be  an  attractive  alternative  to  that  of  traditional  sampling  theory 
because  it:  (a)  provides  a  solution  to  the  problem  of  negative  point 
and  interval  estimates  of  variance  components;  and  (b)  provides  inter¬ 
val  estimates  of  variance  components  interpretable  with  respect  to  the 
sample  data,  not  to  repeated  sampling.  While  the  Bayesian  methods 
have  rarely  been  applied  to  generalizability  theory,  the  work  of  Box 
and  Tiao  (see  also  Davis,  1974;  Fyans,  1977;  Novick  &  Jackson,  1974; 
Novick,  Jackson  &  Thayer,  1971)  provides  an  important  starting  point. 

Hill  (1965,  1967,  1970;  see  also  Novick  et  al.,  1971)  pointed  out 
that  sampling  theory's  usual  unbiased  estimate  of  ignores  relevant 

information  contained  in  0^  -  +  no^.  For  Bayesians,  a  negative 

estimate  of  the  between  persons  component  indicates  that  MSW  and  MSfi 
are  providing  conflicting  information.  A  Bayesian  approach,  then, 
incorporates  information  about  cr£  that  is  constrained  in  both  MSW  and 
MS„.  The  approach  also  includes  the  constraint  that  MS_  >  MS,,  (Box  6 
Tiao,  1973,  see  p.  254). 

Fyans  (1977)  provided  a  general  strategy  for  obtaining  Bayesian 
estimates  of  the  modes  of  the  posterior  distributions  for  variance  com¬ 
ponents  from  crossed,  partially  nested,  and  completely  nested  designs. 
Following  Box  and  Tiao's  (1973)  formulation,  he  assumed  a  locally  uni¬ 
form  prior  with  p(a)  =  a  ^  and  constrained  the  variance  components  to 
be  greater  than  or  equal  to  zero.  Under  normality  and  independence, 
the  joint  mode  (V)  for  the  posterior  distribution  of  any  source  of  vari¬ 
ation  in  a  design  is  given  by  its  sum-of-squares  divided  by  the  appro¬ 
priate  degrees  of  freedom  plus  two  (Fyans,  1977).  In  a  p  x  i  design, 
for  example,  the  joint  modes  would  be: 


[15] 


$  =  SS  / (df  +  2)  , 

P  P  P 

=  SSi/(dfi  +  2)  , 


& 
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and 


SS  ,/(df 
pi  pi 


+  2)  . 


By  equating  the  — i.e.,  the  adjusted  mean  squares — to  their  ex¬ 

pectation,  Bayesian  modal  estimates  of  the  variance  components  (o2)  can 
be  obtained: 


[16] 


02  =  &  -  &  «)/  . 
p  p  pi  m 


i  •  (*i  -  V'.p 


P,  -  »  ,  . 

pi  pi 


Box  and  Tiao  (1973)  provided  posterior  distributions  for  variance  com¬ 
ponents  in  crossed  and  completely  nested  designs.  Fyans  (1977)  provided 
a  posterior  distribution  for  variance  components  for  a  partially  nested 
design.  With  estimates  in  hand,  a  Bayesian  generallzability  coefficient 
can  be  obtained  in  the  usual  manner. 

If  the  posterior  modal  values  do  not  satisfy  the  constraint  of  non¬ 
negative  estimated  variance  components,  the  Bayesian  approach  sets  all 
joint  modes  equal  to  each  other  and  uses  a  pooled  estimate  for  the  common 
value  (Fyans,  1977,  p.  151). 

Finally,  the  interpretation  of  the  Bayesian  interval  is  more  straight¬ 
forward  than  is  the  interval  obtained  by  sampling  theory.  In  sampling 
theory,  the  probability  statement  refers  to  all  possible  confidence  in¬ 
tervals  rather  than  to  a  particular  interval — the  one  in  hand — that  was 
constructed  from  sample  data.  In  contrast,  the  probability  statement 
associated  with  the  Bayesian  interval  refers  directly  to  likely  values 
of  the  population  variance  component  and  not  to  replications  of  the 
design.  In  practical  applications  of  any  measurement  theory,  we  make 
decisions  on  the  basis  of  the  sample  data.  Hence,  the  Bayesian  inter¬ 
pretation  corresponds  to  how  variance  components  are  used  in  practical 
applications  of  G  theory. 
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Cronbach  et  al.'s  (1972)  evaluation  of  the  potential  contribution 
of  Bayesian  G  theory  has  held  up  over  time: 

Bayesian  statistical  inference  needs  to  be  exploited 
systematically.  It  appears  likely  that  developments 
now  available  in  the  statistical  literature  could,  in 
some  problems,  profitably  replace  the  methods  of  esti¬ 
mating  variance  components. .. [and  universe  scores]. 

Also,  whereas  we  obtain  all  estimates  from  the  G  study, 
one  could,  by  Bayesian  methods,  take  into  account  the 
additional  information  offered  by  the  D  study  to  reach 
final  conclusions  about  the  generalizability  of  the  D 
data  [p.  336,  italics  ours]. 

Novick  (1976)  went  even  further  than  Cronbach  et  al.,  by  pointing  out 
that  if  one  accepts  de  Finetti's  exchangeability  concept  (see  Section 
2.2),  "Generalizability  Theory... is  Bayesian  in  everything  but  a  formal 
sense,  though  I  do  not  think  the  authors  see  it  that  way"  (p.  24).  In 
our  opinion,  it  is  time  that  the  formal  sense  in  which  G  theory  is 
Bayesian  be  systematically  explored. 

2.1.3.  Allocation  of  observations  to  reduce  the  sampling  variability 
of  estimated  variance  components 

Woodward  and  Joe  (1973)  and  Smith  (1978)  addressed  the  problem  of 
how  to  allocate  measurements  (e.g.,  n^  and  n^ )  to  maximize  the  general¬ 
izability  coefficient  (Woodward  &  Joe)  or  to  produce  the  most  stable 
estimates  of  variance  components  (Smith).  They  arrived  at  similar  recom¬ 
mendations.  In  a  p  x  i  x  j  design,  for  example,  as  °^es  increases  rela¬ 
tive  to  o2 .  and  o2 . ,  the  optimal  solution  tends  toward  n,  =  n , .  As 
Pi  PJ  i  j 

a2  decreases  relative  to  o2  and  a2.,  the  optimal  solution  is  to  make 
res  ^  pi_  pj 

n.  and  n,  proportional  to  o2  /a2.  . 
i  j  Pi  PJ 

2.1.4.  Monte  Carlo  studies  of  variance  components 

Smith  conducted  Monte  Carlo  studies  with:  (a)  three  designs — 

Design  A  =  p  x  i  x  j,  Design  B  =  p  x  ( j : i) ,  and  Design  C  =  (j :p)  x  i; 

(b)  n  =  25,  50,  100;  (c)  n,  and  n.  =  2,  4,  8;  (d)  o2:o2  =  1:4,  4:1; 

P  1  J  1  j 

and  (e)  o2  *  20,  76.  A  random  effects,  p  x  i  x  ]  ANOVA  model  (assuming 
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additivity  and  independence)  was  used  to  generate  normally  distributed, 
integer  scores.  For  each  case,  300-500  replications  were  obtained. 

Smith  concluded  that  the  estimated  variance  components  from  multi¬ 
faceted  generalizability  studies  contain  sizable  error.  More  specifi¬ 
cally,  he  found  that:  (a)  The  sampling  errors  of  variance  components 
are  much  greater  for  multifaceted  universes  than  for  single  faceted 
universes.  (b)  For  ,  the  sampling  errors  were  large  unless  the  total 

number  of  observations  (n  n.n.)  was  at  least  800.  (c)  Stable  estimates 

P  1  J 

of  o^  and  of  required  at  least  eight  lew’  -  of  each  facet.  And  (d)  some 
nested  designs  produced  more  stable  e';^Aes  than  did  crossed  designs. 

Simulations  similar  to  those  conducted  by  Calkins,  Erlich, 

Marston,  and  Malitz  (1978),  Leone  ans>,  on  (1966)  and  Smith  (1980) 
have  reached  similar  conclusions.  Calkins  et  al.  (1978)  and 

Leone  and  Nelson  (1966)  also  found  many  negative  estimates  of  variance 
components,  especially  when  the  number  of  levels  of  a  facet  was  small 


(e.g.,  five). 

Recognizing  these  problems  with  estimated  variance  components.  Smith 

(1978,  1980)  proposed  the  use  of  several  small  G  studies  with  many  levels 

of  one  or  a  few  facets  instead  of  one  large,  crossed  G  study.  One  small 

,  2  2 
G  study  might  estimate  o£,  aj,  and  ap±,  while  another  might  estimate  ap 

0^ ,  and  Opj  such  that  the  total  number  of  observations  would  be  equal 
to  that  of  the  one  large,  crossed  G  study.  We  question,  however,  whether 
a  universe  of  admissible  observations  represented  in  a  series  of  G  stud¬ 
ies  with  more  restricted  universes  is  equivalent  to  the  universe  repre¬ 
sented  by  one,  larger  universe.  We  also  question  whether  the  construc¬ 
tion  of  a  large  G  study  from  a  series  of  smaller  G  studies  would  provide 
information  appropriate  for  the  decisionmaker's  universe  of  generali¬ 
zation  in  a  D  study. 


2.1.5.  Unbalanced  Designs 

An  unbalanced  design  has  unequal  numbers  of  observations  in  its 
subclassifications.  Two  examples  of  unbalanced  G  study  designs  are 
(1)  pupils  nested  in  classes  where  class  size  is  not  constant,  and  (2) 
observers  nested  in  occasions  where  unequal  numbers  of  observers  are 
present  at  each  occasion. 
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In  a  comprehensive  review,  Searle  (1971,  p.  35)  pointed  out  that 
the  usual  ANOVA  approach  to  estimating  variance  components  with  balanced 
data — setting  mean  squares  equal  to  their  expectation — is  not  as  straight¬ 
forward  when  applied  to  unbalanced  data.  This  section  shows  '  jw  the 
usual  ANOVA  approach  to  estimating  variance  components  in  balanced  de¬ 
signs  can  be  applied  to  unbalanced  designs  and  points  to  an  alternative 
approach  using  the  computer  (cf.  Llabre,  1978,  1980;  Brennan,  Jarjoura 
&  Deaton,  1980). 

Several  aspects  of  the  ANOVA  approach  differ  in  unbalanced  designs 
and  are  problematic.  First,  the  sums  of  squares  are  not  additive.  The  mean 
squares,  therefore,  may  be  unadjusted  or  adjusted  for  one  or  more  effects.  And 
the  choice  of  adjustment  is  not  always  clear  (see  Searle,  1971).  Second, 
solutions  to  the  problem  of  non-additive  sums  of  squares  lead  to  biased 
estimation  in  mixed  models.  And,  third,  the  simple  rules  for  deriving 
expected  values  of  mean  squares  (Cornfield  &  Tukey,  1956)  do  not  apply 
to  unbalanced  designs.  The  coefficients  in  the  expected  mean  square 
equations  are  algebraically  and  computationally  complex. 

Henderson  (1953)  proposed  variations  in  the  analysis  of  variance 
approach  to  deal  with  the  first  problem.  The  problem  of  biased  esti¬ 
mation  in  mixed  models  is  not  a  problem  in  G  theory,  since  G  theory  is 
essentially  a  random  effects  theory.  That  is,  G  theory  averages  over 
fixed  facets  in  a  mixed  model  and  so  estimates  only  variances  of  random 
effects.  Finally,  computational  complexity  is  reduced  by  using  Rao's 
(1971,  1972)  approach  called  minimum  variance  quadratic  unbiased  esti¬ 
mation  (MIVQUE;  see  Llabre,  1978,  1980;  Brennan,  Jarjoura  &  Deaton, 

1980).  Incidentally,  MIVQUE  also  avoids  the  problem  of  the  order  of 
the  components. 

MIVQUE  is  available  in  the  VARCOMP  procedure  in  the  SAS  (Statistical 
Analysis  System,  1979)  computer  system  to  small  designs.  Brennan, 

Jarjoura  and  Deaton  (1980)  reviewed  this  and  other  computer  programs 
for  estimating  variance  components  in  unbalanced  designs.  They  men¬ 
tioned  the  limited  storage  capacities  of  many  of  the  programs,  which 
restrict  their  use  to  small  designs.  The  major  problem  remaining  in 
the  estimating  of  variance  components  with  unbalanced  data,  then,  is 
to  develop  efficient  computer  programs  that  will  estimate  variance 


componets  in  large  generalizability  designs  without  requiring  prohibi¬ 
tively  large  storage  capacities. 

2.2.  Fixed  Facets 

In  G  theory  a  fixed  facet  has  a  fixed  set  of  conditions  that  appear 
in  the  G  and  D  study.  Although  this  definition  parallels  that  given  for 
a  fixed  factor  in  sampling  theory,  it  also  describes  facets  which  are 
often  considered  in  practice  to  be  random.  Loevinger  (1965)  goes  so  far 
as  to  argue  that  all  facets  must  be  considered  fixed  in  any  measurement 
theory.  This  issue  is  taken  up  below. 

Statistically,  G  theory  treats  fixed  facets  by  averaging  (or  sum¬ 
ming)  over  the  conditions  of  the  fixed  facet  and  examining  the  generali¬ 
zability  of  these  averages  over  the  random  facets  (Cronbach  et  al., 

1972,  p.  60;  see  Erlich  &  Shavelson,  1976b,  for  a  proof).  While  this 
treatment  of  fixed  facets  is  justifiable  on  sampling  theory  grounds, 
it  does  not  always  lead  to  a  sensible  treatment  of  fixed  facets  in  the 
measurement  theory.  This  issue  is  also  considered  below. 

2.2.1.  Fixed  versus  random  facets 

Often  a  test  developer  has  a  fixed  set  of  items  which  he  considers 
to  be  a  random  sample  of  items  from  some  more  or  less  well  defined  uni¬ 
verse.  Can  this  set  of  items  legitimately  be  considered  a  random  sam¬ 
ple  or,  as  Loevinger  (1965)  insists,  should  it  be  considered  a  fixed 
facet? 


The  objection  to  all  psychometric  developments  that  assume 
random  sampling  of  items  or  tests  is  in  the  first  instance 
that  they  grossly  misrepresent  the  actual  case,  which  is 
almost  invariably  expert  selection  rather  than  random  sam¬ 
pling.  But  there  is  a  subtler  and  deeper  point.  The  term 
population  implies  that,  in  principle  one  can  catalog,  or 
display,  or  index  all  possible  members,  even  though  the 
population  and  the  catalogue  [sic]  cannot  be  completed. 
Statistical  sampling  must  be  tied  to  such  a  display  and 
indexing  system,  else  it  cannot  be  random  [Loevinger,  1965, 
p.  147]. 


One  possible  way  to  resolve  this  problem  is  suggested  by  de 
Finetti's  (1964)  concept  of  exchangeability  (cf.  Kingman,  1978; 
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Lindley  &  Novick,  1979;  Novick,  1976;  see  also  Davis,  1974;  Fyans, 

1977).  Put  (perhaps,  too)  simply,  this  concept  states  that  even  though 
conditions  of  a  facet  have  not  been  sampled  randomly,  the  facet  may  be 
considered  to  be  random  if  conditions  not  observed  in  the  G  study  can 
be  exchanged  with  the  observed  conditions.  Formally, 

The  random  variables  X . X  are  exchangeable  if  the  n! 

1  n  -  - 

permutat ions  (X^^, . . . ,X^^)  have  the  same  n-dimensional 
probability  distribution.  The  variables  of  an  infinite 
sequence  X^  are  exchangeable  if  X^ , . . . X^  are  exchangeable 
for  each  n  (Feller,  1966,  p.  225]. 

Although  not  explicitly  stated,  the  concept  of  exchangeability  is 
evident  in  discussions  of  G  theory  applications.  For  example,  Elffers 
and  Tavecchio  (1979,  p.  5)  argued  that  as  long  as  the  conditions  of  the 
facets  in  the  G  study  are  not  very  different  from  the  larger  set,  the 
facet  can  be  considered  random. 

Viewed  from  the  exchangeability  perspective,  the  issue  of  fixed  or 
random  facets  is  not  whether  one  can  catalog  (etc.)  all  possible  mem¬ 
bers  of  a  population,  but  whether  the  members  in  hand  are  exchangeable 
with  other  potential  members.  In  terms  of  item  sampling,  if  one  set 
of  persons  and  items  to  which  p2  is  genaralizable  is  a  set  of  such 
persons  and  items  jointly  exchangeable  with  the  present  sample,  it  is 
reasonable  to  consider  the  item  facet  to  be  random. 

The  concept  of  exchangeability,  at  a  minimum,  provides  reasonable 
grounds  for  considering  whether  a  facet  is  random  or  fixed.  At  best, 
it  suggests  that  random  facets  abound. 

2.2.2.  Statistical  treatment  of  fixed  facets 

G  theory  treats  a  fixed  facet  in  one  of  several  different  ways. 
Perhaps  the  most  frequent  approach  is  to  average  scores  over  the  con¬ 
ditions  of  the  fixed  facet  and  examine  the  generalizability  of  this 
average  over  the  random  facets.  For  example,  general  aptitude  batteries 
like  the  ACT  Assessment  are  designed  to  predict  future  academic  achieve¬ 
ment.  Clearly,  the  ACT  subtests  are  fixed,  so  scores  would  be  averaged 


20 


over  subtests  and  the  generalizability  of  this  average  examined  over 
the  random  facets  of  the  design.  In  this  case,  averaging  scores  over 
subject  matters  makes  good  sense  from  the  standpoint  of  prediction  of 
academic  success.  Notice  that  by  averaging  over  the  conditions  of  a 
fixed  test,  G  theory  is  essentially  a  random  model  theory. 

A  second  approach  is  to  examine  the  generalizability  of  scores  at 
each  condition  of  the  fixed  facet.  For  example,  the  generalizability 
of  scores  on  the  ACT  Assessment  would  be  examined  separately  for  each 
subtest.  Often,  in  treating  each  condition  of  a  fixed  facet  separately 
the  decisionmaker  is  willing  to  consider  the  conditions  of  the  facet 
as  a  profile  of  scores.  Hence,  the  analysis  focuses  on  estimating  each 
subject's  universe  score  on  each  condition  of  the  fixed  facet  (Cronbach 
et  al.,  1972).  For  example,  the  ACT  subtests  might  be  considered  a  pro¬ 
file  and  estimates  of  students'  universe  scores  would  be  obtained  for 
each  subtest.  (For  a  discussion  of  this  approach,  see  Section  2.5.) 

The  third  approach  is  to  choose  one  of  the  first  two  approaches 
on  the  basis  of  the  estimated  variance  of  the  conditions  of  a  fixed 
facet.  In  other  words,  in  the  absence  of  strong  a  priori  reasons  for 
averaging  over  the  conditions  of  a  fixed  facet  or  treating  the  condi¬ 
tions  as  a  profile,  the  decisionmaker  examines  the  variability  of  the 
conditions  of  a  fixed  facet.  If  the  variability  is  minimal,  the  scores 
may  be  averaged  over  the  conditions  of  the  fixed  facet.  If  the  vari¬ 
ability  is  substantial,  the  decisionmaker  may  choose  to  treat  each  con¬ 
dition  separately  or  consider  the  conditions  as  a  profile. 

The  decision  about  how  to  treat  a  fixed  facet  in  a  G  study  is  not 
necessarily  straightforward,  as  illustrated  by  a  study  of  measures  of 
teacher  behavior  (Erlich  &  Shavelson,  1978).  Teachers  were  observed 
on  three  occasions  by  two  raters  while  teaching  reading  and  math. 

Subject  matter  was  treated  as  fixed,  and  teachers'  scores  were  averaged 
over  reading  and  math  lessons.  Teaching  behavior,  however,  is  quite 
different  during  reading  and  math.  Averaging  over  those  two  subject 
matters  may  have  distorted  the  phenomena  being  observed  as  well  as  the 
estimated  universe  score  variance.  A  preferable  strategy  in  this  study 
might  have  been  to  examine  the  generalizability  of  the  reading  and  math 
data  separately  or  as  a  profile.  However,  an  elementary  school  principal 
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might  be  interested  in  a  teacher's  behavior  in  general  and  so  be  quite 
willing  to  use  an  average  over  reading,  math,  and  other  subjects.  For 
the  principal's  purpose,  Erlich  and  Shavelson's  (1978)  treatment  of  the 
subject  matter  facet  may  have  been  appropriate. 

Since  the  decisionmaker  determines  whether  a  facet  is  fixed  or 
random,  the  most  reasonable  solution  may  be  to  report,  in  a  G  study, 
the  variance  components  and  generalizability  coefficients  for  each  con¬ 
dition  of  a  fixed  facet  separately  (usually  there  are  only  a  few  condi¬ 
tions  of  a  fixed  facet)  and  in  combination  (averaged  over  the  conditions). 
By  doing  so,  the  decisionmaker  can  choose  which  data  are  most  pertinent 
to  a  proposed  D  study. 

2.3.  Criterion-referenced  Measurement 

The  term  criterion-referenced  measurement  (CRM)  has  been  defined 
2 

in  a  variety  of  ways.  Most  of  these  definitions  include  a  well- 
specified  content  domain  (Hambleton  &  Novick,  1973;  Popham,  1975). 

Following  Cronbach  et  al.  (1972)  and  Brennan  (1980),  we  speak  of 
interpretations  of  test  scores  as  being  criterion,  or  content,  or  domain 
referenced.  The  observed  score  is  interpreted  as  being  representative 
of  the  universe  of  content  from  which  it  is  sampled. 

Since  criterion-referenced  interpretations  consider  an  individual's 
status  independent  of  other  persons  (cf.  absolute  decisions  in  Cronbach 
et  al.,  1972,  p.  14),  "The  primary  question  is:  how  large  is  the  error 
arising  from  incomplete  observation?"  (Cronbach  et  al.,  1972,  p.  23). 

The  errors,  A  and  e,  and  the  G  coefficient  for  absolute  decisions  speak 
directly  to  criterion-referenced  interpretations  (cf.  Brennan,  1980; 

Brennan  &  Kane,  1977a, b;  Hambleton,  Swaminathan,  Algina  &  Ccurson,  1978; 
Kane  &  Brennan,  1980;  Linn,  1979;  Shavelson,  1979). 

Cronbach  et  al.  (1972)  discussed  three  approaches  to  estimating  uni¬ 
verse  scores:  regression,  interval,  and  Bayesian  estimates.  Generaliza¬ 
bility  theory  greatly  complicates  regression  estimates  "when  it  recognizes 
that  conditions  may  not  be  equivalent  and  considers  any  set  of  conditions 
to  be  a  sample  from  a  universe"  (Cronbach  et  al.,  1972,  p.  140).  Moreover, 
with  interval  estimates,  the  probability  statements  about  nonrandomly 
selected  individuals  are  not  justifiable.  Finally.  Cronbach  et  al .  (1972) 
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suggested  that  estimation  problems  might  be  solved  with  a  Bayesian 
approach  (see  Section  2.1.2)  if  this  approach  could  be  extended  to  com¬ 
plex  designs. 

In  reviewing  the  literature,  we  were  surprised  to  find  that  the  call 
for  systematic  work  on  Bayesian  estimates  of  universe  scores  in  G  theory 
has  not  been  answered.  Fyans  (1977)  provided  a  good  summary  of  methods 
proposed  by  Novick  (1969;  Novick  &  Hall,  1965),  Box  and  Tiao  (1968), 
and  Lindley  (1971).  The  work  of  Novick  (Novick,  Jackson,  Thayer  &  Cole, 
1972;  Novick  et  al.,  1971;  Novick,  Lewis  &  Jackson,  1973;  Novick,  Jack- 
son  &  Thayer,  1971;  Lewis,  Wang  &  Novick,  1975;  see  Molenaar  &  Lewis, 
1979,  for  a  computer  program)  and  others  on  m-group  regression  and  the 
work  of  Wilcox  (1978)  with  empirical  Bayes  estimation  procedures  for 
true  scores  in  the  compound  binomial  error  model  seem  to  be  a  logical 
starting  place  for  Bayesian  estimation  of  universe  scores. 

While  G  theory  focuses  on  estimation,  most  of  the  CRM  literature 
has  focused  on  generalizability  coefficients  (cf.  Brennan,  1980; 

Brennan  &  Kane,  1977a, b;  Kane  &  Brennan,  1980).  Brennan  and  Kane 
(1977a)  proposed  a  coefficient  which  paralleled  Livingston's  (1972a) 
coefficient  developed  within  classical  test  theory.  The  reasoning  went 
as  follows.  In  CRM,  we  are  interested  in  the  difference  between  a  per¬ 
son's  universe  score  in  a  well-defined  behavioral  domain  and  some  cri¬ 
terion  in  that  domain.  In  estimating  a  person's  universe  score  from 
his  observed  score,  the  error  is 


A  =  (X  -  A)  —  (p  -  A)  =  X  ,  -  u  , 
pi  pi  p  pi  P 


where  A  is  the  criterion.  From  this  formulation,  their  index  of  depend¬ 
ability  for  a  domain-referenced  test — $(1) — is: 


[17] 
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In  the  special  case  where  X  =  p,  the  index  of  dependability  is 
equal  to  the  G  coefficient  for  absolute  decisions: 


Brennan  (1980)  characterized  4>(A)  as  follows:  (a)  It  uses  the 

error  associated  with  absolute  decisions  (a?)  rather  than  the  error  for 

A 

relative  decisions  (c^).  (b)  It  varies  from  0  to  1.  (c)  It  varies 

o 

with  different  values  of  X,  i.e.,  the  numerator  depends  on  the  universe 
score  variance  and  the  squared  distance  of  the  population  mean  from  the 
criterion.  (For  a  critique  of  this  characteristic,  see  Harris,  1972; 
Linn,  1979;  Shavelson,  Block  &  Ravitch,  1972;  but  see  also  Kane  & 
Brennan,  1980;  Livingston,  1972b.)  And  (d)  $(X)  can  be  positive  with 
zero  universe-score  variance.  Brennan  (1979b,  pp.  27-28;  see  also 
Brennan,  1980;  Kane  &  Brennan,  1980)  distinguished  the  interpretation 
of  <t> ( X)  from  that  of  4>  as  follows:  "4>(X)  provides  an  estimate  of  the 
dependability  of  the  decisions  based  on  the  testing  procedure  [includ¬ 
ing  chance  agreement  in  scores];  <P  provides  an  estimate  of  the  contri¬ 
bution  of  the  testing  procedure  to  the  dependability  of  such  decisions." 

In  generalizability  theory,  we  are  primarily  interested  in  vari¬ 
ance  components  rather  than  generalizability  coefficients.  And  esti¬ 
mates  of  variance  components  are  not  changed  by  introducing  a  criterion 
for  the  purpose  of  estimating  an  index  of  dependability  (Linn,  1979). 

If,  in  CRM,  interest  attaches  to  a  mastery-nonmastery  decision, 
as  implied  by  $(A),  a  coefficient  of  generalizability  seems  less  impor¬ 
tant  than  an  estimate  of  the  probabilities  of  false  positive  (a)  and 
false  negative  (8)  decisions.  The  distinction  between  a  and  8  is  im¬ 
portant  because  the  seriousness  of  each  may  not  be  the  same,  which,  in 
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Wilcox's  (1977,  1979)  work  provides  a  starting  point  for  dealing 
with  this  situation.  Assume  that  n^  items  are  randomly  sampled  from  a 
skill  domain.  They  are  administered  to  an  examinee  in  order  to  deter¬ 
mine  whether  his  true  score,  p  ,  is  above  or  below  the  known  criterion 

P 

score  X.  If  p  >  X,  the  examinee  is  a  master.  However,  the  decision 
P 

about  pp  >_  X  is  made  if  and  only  if  the  examinee's  observed  score,  X, 

is  greater  than  or  equal  to  Xq,  the  "operational  criterion"  for  deciding 

mastery.  Note  that  the  choice  of  Xq  may  incorporate  the  decisionmaker's 

notion  of  the  losses  associated  with  a  and  6  (cf.  Wilcox,  1979,  p.  60) 

and  so  X  may  not  equal  X.  In  order  to  estimate  a  and  6,  Wilcox  (1977) 
o 

used  an  empirical  Bayes  approach  assuming  either  a  beta  distribution 
or  a  normal  distribution  (on  transformed  X's)  for  true  scores  and  a  bi¬ 
nomial  probability  function  of  observed  scores  given  true  scores. 

Noting  problems  with  beta  and  normal  priors,  Wilcox  (1979)  worked  out 
upper  and  lower  bounds  for  a  and  8  which  make  no  assumptions  about  the 
form  of  the  true  score  distribution. 

2.4.  Symmetry 

The  purpose  of  psychological  measurement  has  typically  been  to 
differentiate  individuals.  The  focus  on  individuals  is  reflected  in 
Cronbach  et  al.'s  mentioning  only  the  case  of  measuring  attributes  of 
schools  or  classrooms  (teachers)  as  an  alternative  to  measuring  attri¬ 
butes  of  individuals  within  them.  Unlike  the  traditional  concentration 
on  individuals,  however,  Cardinet,  Tourneur  and  Allal  (1976a, b,  in 
press;  Cardinet  &  Tourneur,  1974,  1977;  Tourneur,  1978;  Tourneur  & 
Cardinet,  1979)  recognized  that  the  focus  of  measurement  may  change 
depending  on  a  particular  decisionmaker's  purpose.  "One  can  easily 
cite  cases,  particularly  in  educational  research,  where  the  purpose  of 
measurement  is  to  compare  the  rates  of  success  for  different  test  items, 
or  for  different  instructional  treatments"  (Cardinet  et  al.,  in  press, 
p.  6;  cf.  Wood,  1976a).  Individual  differences,  then,  may  represent 
a  source  of  error — rather  than  universe  score — variation  in  the  measure¬ 
ment. 

Cardinet  and  his  colleagues  speak  of  a  principle  of  symmetry:  "The 
principle  of  symmetry  of  the  data  is  simply  an  affirmation  that  each  of 
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the  facets  of  a  factorial  design  can  be  selected  as  an  object  of  study, 
and  that  the  operations  defined  for  one  facet  can  be  transposed  in  the 
study  of  another  facet"  (Cardinet  et  al.,  in  press,  p.  7). 

The  principle  of  symmetry  led  Cardinet  et  al.  (in  press)  to  dis¬ 
tinguish  between  four  stages  of  a  measurement  study:  (a)  the  obser¬ 
vation  design,  (b)  the  estimation  design,  (c)  the  measurement  design, 
and  (d)  the  optimization  design.  By  distinguishing  these  four  stages, 
Cardinet  et  al.  (in  press)  were  able  to  disentangle  measurement  con¬ 
siderations  (e.g.,  specification  of  the  object  of  measurement)  from  the 
computations  that  yield  estimates  of  variance  components.  "Until  the 
two  kinds  of  problems  (ANOVA  estimates,  measurement)  were  clearly  dis¬ 
entangled,  no  multi-purpose  measurement  was  possible"  (Cardinet,  per¬ 
sonal  communication,  May  9,  1980). 

The  first  stage — observation — includes  the  choice  of  facets  and 
conditions  and  computation  of  mean  squares.  The  second  stage — esti¬ 
mation — involves  the  decision  about  whether  the  facets  are  finite  or 
infinite  and  random  or  fixed  (cf.  Wood,  1976a,  for  an  application  of 
finite,  random  facets),  and  the  estimation  of  variance  components. 

The  third  stage — measurement — specifies  which  facet  (or  combination 
of  facets,  see  below)  is  the  focus  of  measurement  and  which  facets  may 
limit  the  generalization  of  the  measurement  (i.e.,  sources  of  error). 
Estimates  of  error  (o^,o^)  and  generalizability  (Sp-0  are  obtained  in 
this  stage.  The  fourth  stage  provides  information  relevant  to  alter¬ 
native  D-study  designs. 

As  an  example  of  the  four  stages  in  a  study,  consider  a  G  study 
of  student  evaluations  of  teaching.  An  observation  design  might  have 
classrooms  (c),  students  nested  within  classrooms  (s:c),  and  items  (i) 
crossed  with  both  classrooms  and  students  (see  Smith,  1979,  for  a  dis¬ 
cussion  of  designs  for  student  ratings).  For  simplicity,  assume  that 
the  same  number  of  students  is  observed  in  each  classroom  (but  see 
Section  2.1.5). 

In  the  estimation  design,  a  decision  is  made  about  whether  the 
model  is  random  or  fixed  (see  Section  2.2  for  a  discussion  of  fixed 
facets).  The  variance  components,  then,  are  estimated  according  to 
the  appropriate  model. 
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In  the  measurement  design,  the  focus  of  measurement — classrooms — 
is  identified  along  with  sources  of  error  variation  (students  and 

items)  and  a2  or  o2  and  CP2  are  estimated  (see  Smith,  1979).  Kane 

o  a 

(&  Brennan,  1977;  Kane  et  al.,  1976;  see  also  Gillmore  et  al.,  1978; 

Smith,  1979)  pointed  out  that  the  magnitude  of  the  generalizability 

coefficient  depends  upon  whether  items  is  considered  a  fixed  or  random 

facet.  While  the  expected  observed  score  variance  remains  the  sai.e — 

o2  +  o2./n.  +  o2  ./n  +  a2/ n.n  — the  universe  score  vari- 

c  ci  l  (s,s:c)  s  r  i  s 

ance  is  o2  +  o2  /n . .  Kane  and  Brennan  (1977;  Kane  et  al.,  1976)  also 
c  ci  i 

discussed  the  case  where  the  student  facet  is  fixed  and  the  item  facet 
is  random — i.e.,  the  case  of  a  nested  universe.  Finally,  Kane  (et  al., 
1976;  see  also  Kane  &  Brennan,  1977)  pointed  out  that  the  instructor 
effect  is  confounded  with  course  content,  differential  selection  of 
students  into  classes,  observation  occasion,  and  so  on.  This  confound¬ 
ing  will  inflate  the  variance  attributed  to  instructor  to  an  unknown 
extent.  Gillmore  (1980;  et  al.,  1978)  and  Smith  (1979)  presented  de¬ 
signs  (and  data)  that  reduce  this  confounding. 

Incidentally,  Kane  and  Brennan  (1977)  related  generalizability 
theory's  approach  to  estimating  the  reliability  of  classroom  means  to 
approaches  proposed  in  classical  theory.  They  showed  that  classical 
theory  approaches  treated  one  facet  as  random  while  the  other  facet 
was  implicitly  treated  as  fixed. 

Finally,  the  fourth  stage  of  the  measurement  study  would  consider 
alternative  sample  sizes  for  students,  items  or  both  (depending  on 
whether  the  model  is  random  or  mixed).  It  would  also  take  into  account 
the  possibility  of  nesting  items  within  students  (see  Cronbach  et  al., 
1972,  on  matrix  sampling  studies,  p.  214ff). 

The  principle  of  symmetry  leads  to  the  possibility  of  multifaceted 
populations.  Cardinet  &  Tourneur  (1977;  with  Allal,  1976  (a,b),  in 
press)  noted  that,  in  surveys  of  educational  achievement,  the  focus  of 
measurement  is  on  activities  (objectives),  years  or  schools,  and  not 
on  students.  In  their  example,  the  survey  might  have  focused  on  the 
attainment  of  educational  objectives  (o)  nested  in  content  units  (o:c) 
and  crossed  with  students  (s) .  The  universe  score  of  interest,  then, 
is  that  of  (say)  objectives  nested  within  content  units  (o:c)  while 
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the  sampling  of  students  gives  rise  to  errors  of  measurement.  [A 
variety  of  such  designs  is  given  in  Cardinet,  Tourneur  &  Allal  (1976; 
in  press. ) ] 

The  notions  of  symmetry  and  multifaceted  populations  lead  to 
several  important  consequences.  In  the  (o:c)  x  s  design  described 
above,  for  example,  an  increase  in  the  generalizability  coefficient 
would  be  obtained  by  increasing  sample  size  (e.g.,  number  of  students). 
This  is  just  what  would  happen  in  sampling  theory  by  specifying  power, 
a,  a  difference  in  means  to  be  detected,  and  then  calculating  n.  In 
short,  both  approaches  lead  to  a  decrease  in  the  standard  error  of  the 
mean.  This  is  one  point  where  the  specialized  area  of  measurement 
theory  meets  the  more  general  area  of  sampling  theory's  estimation  of 
differences  between  means. 

A  second  consequence  of  symmetry  and  multifaceted  populations  is 
that  the  decisionmaker  is  able  to  systematically  examine  the  assumption 
that  measures  are  taken  on  a  sample  of  persons  from  a  homogeneous  popu¬ 
lation.  For  example,  if  a  population  consists  of  subjects  nested  within 
sex  and  socioeconomic  status  (SES) ,  variance  components  can  be  estimated 
for  each  facet.  If  the  variance  components  for  sex  and  SES  are  negli¬ 
gible  for  the  particular  attibute  being  measured,  the  decisionmaker  can 
assume  a  homogeneous  population  and  so  reduce  the  design  of  the  D  study. 
If  the  components  are  sizeable,  the  decisionmaker  may  calculate  separate 
estimates  for  each  subgroup. 

One  final  contribution  of  the  principle  of  symmetry  to  be  mentioned 
here  is  that  it  has  led  Cardinet  and  his  colleagues  to  consider  that 
case  of  a  fixed  facet  comprising  the  focus  of  measurement.  For  example, 
evaluation  of  teachers  in  a  school  system  might  involve  observers 
periodically  observing  the  teachers.  In  this  case,  estimates  of  dif¬ 
ferences  between  a  fixed  set  of  teachers  or,  more  appropriately,  esti¬ 
mates  of  their  universe  scores  might  be  the  focus  of  measurement.  Like¬ 
wise,  in  industrial  settings  where  supervisors'  ratings  of  employees 
are  gathered,  one  might  consider  employees  as  a  fixed  facet  within 
some  period  of  time. 
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2.5.  Multivariate  Generalizabl lity 

Educational  and  psychological  measurements  often  involve  multiple 
scores  describing  individuals'  aptitudes  or  skills.  For  example,  com¬ 
posites  of  CTBS  subtest  scores  are  used  in  classifying  children  for 
educational  programs;  and  university  instructors  in  science  laborator¬ 
ies  look  for  proficiency  in  students'  manipulative,  observational, 
interpretative,  and  planning  skills  (cf.  Wood,  1976).  Although  multi¬ 
ple  scores  may  be  conceived  as  vectors,  and  thus  should  be  treated 
simultaneously,  the  majority  of  generalizability  and  decision  studies 
have  not  done  so.  Rather,  each  variable  has  been  treated  separately. 

One  reason  is  the  paucity  of  theory  and  procedures  for  examining  multi¬ 
ple  outcomes.  Another  reason  is  that  the  multivariate  literature  is 
not  always  easily  comprehended.  In  an  attempt  to  remedy  this  situation, 
we  outline  Cronbach  et  al.'s  (1972)  contributions  more  extensively  here 
than  in  other  parts  of  the  review.  We  then  describe  further  develop¬ 
ments  in  multivariate  generalizability  and  point  to  problems  still  in 
need  of  attention. 

2.5.1.  Background 

In  extending  the  notion  of  multifaceted  error  variance  to  multi¬ 
variate  designs,  Cronbach  et  al.  (1972)  stressed  the  separate  treat¬ 
ment  of  the  scores  rather  than  the  use  of  a  composite  of  them.  This 
permits  the  decisionmaker  to  examine  variances  of  and  covariances 
among  the  variables,  and  to  formulate  an  optimal  D-study  design.  As 
was  the  case  in  the  development  of  univariate  G  theory,  Cronbach  et 
al.  focused  on  methods  of  obtaining  and  interpreting  variance  com¬ 
ponents.  Multivariate  G  theory  decomposes  both  variances  and  covar¬ 
iances  into  components,  whereas  univariate  G  theory  examines  only 
components  of  variance.  The  expected  mean  product  equations  are  solved 
in  analagous  fashion  to  their  univariate  counterparts  (for  an  elemen¬ 
tary  exposition,  see  Travers,  1969).  For  example,  the  decomposition 
of  the  variance-covariance  matrix  in  a  one-facet,  crossed  design  with 
two  dependent  variables  is: 
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score  of  variable  1  for  person  p  observed  under 
condition  i, 

score  on  variable  2  for  person  p  observed  under 
condition  g,  and 

abbreviated  for  ,y  :  the  universe  score  on  vari- 
1  P 

able  1  for  person  p. 


In  [19],  the  term  o(^p,2p)  is  the  covariance  between  universe  scores 
on  variables  1  and  2,  say,  ratings  on  two  aspects  of  writing:  organi¬ 
zation  and  coherence.  The  term  o(^i>2g)  is  the  covariance  between 
scores  on  the  two  variables  due  to  the  conditions  of  observation. 

Facet  i  may  be  the  same  as  facet  g,  for  example,  when  the  same  essay 
is  used  to  obtain  ratings  of  organization  and  coherence. 

An  important  aspect  of  the  development  is  the  distinction  between 
linked  and  unlinked  conditions.  When  conditions  for  observing  differ¬ 
ent  variables  are  selected  independently,  the  expected  values  of  all 
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err  >r  components  of  covariance  are  zero.  With  independent  sampling, 
then,  the  expected  values  of  the  off-diagonal  elements  in  the  last 
two  matrices  of  [191  are  zero.  For  example,  in  the  illustration  above 
if  the  essays  used  to  obtain  ratings  of  organization  are  selected  in¬ 
dependently  of  the  essays  used  to  obtain  ratings  of  coherence,  then 
the  expected  value  of  the  component  of  covariance  *-s  zero> 

When  conditions  for  observing  multiple  outcomes  are  not  selected 
independently,  but  are  jointly  sampled,  the  expected  values  of  all 
components  of  covariance  are  non-zero.  Cronbach  et  al.  (1972)  pre¬ 
sented  the  following  example  of  a  non-zero  component  of  covariance  for 
conditions . 

Suppose  that  the  design  calls  for  teacher  i  to  rate  pupils 
p  on  both  ability  v^  and  motivation  v^.  Some  teachers  give 
higher  ratings  on  the  average  than  other  teachers  do;  the  i 
component  represents  this  bias.  The  constant  errors  in  v^ 
ratings  are  likely  to  covary  (over  teachers)  with  the  con¬ 
stant  errors  in  v2  ratings.  The  covariance  "o^i^i)  then 
would  be  positive.  [Page  277;  following  Cronbach  et  al., 
the  bullet  symbol  (•)  indicates  linkage. ] 

The  literature  reviewed  below,  while  acknowledging  the  possibility  of 
linked  conditions,  addresses  only  the  unlinked  case. 

In  their  discussion  and  illustrations  of  multivariate  generali- 
zability  analysis,  Cronbach  et  al.  (1972)  did  not  develop  a  multi¬ 
variate  generalizability  coefficient,  but  focused  almost  entirely  on 
the  interpretation  of  components  of  variance  and  covariance.  They 
examined  components  to  rule  out  "distressing"  counterhypotheses.  In 
an  analysis  of  verbal  and  performance  scores  from  the  WISC  and  WAIS, 
for  example,  test  forms  were  linked  because  verbal  and  performance 
scores  were  observed  on  the  same  form  (WTSC  or  WAIS)  on  the  same  day. 
However,  the  small  variance  and  covariance  components  reported  for 
forms  indicates  that  linkage  is  not  problematic  here. 

Travers  (1969)  developed  a  correction  for  attenuation  analogous 
to  Spearman's  classical  formula,  r?  T  =  rxyW  rXX'  rYY’ 


Travers 
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(1969,  p.  344ff.  and  Cronbach  et  al.,  1972,  p.  287)  showed  that 
Spearman's  formula  can  be  restated  as 


[20] 


0<l"p.2"p>  •  - 


V  ^<llV  '  °2(2up) 


Whether  the  expected  covariance  between  observed  scores  equals  the 
covariance  between  universe  scores  depends  upon  linkages  among  condi¬ 
tions  of  facets  in  the  design.  In  a  two-facet  design,  for  example, 
when  i  and  g  are  independently  sampled,  the  expected  covariance  of 

.X  .  with  „X  is  o(,p,„p).  When  i  and  g  are  linked,  however,  the 

1  pi  2  pg  12  ^ 

expected  covariance  is  o^p^p)  +  of^pi.e.^pg.e) .  When  the  expected 
covariance  between  observed  scores  is  used  as  an  estimate  of  the  co- 
variance  between  universe  scores,  then,  the  corrected  correlation 
obtained  with  joint  sampling  will  tend  to  be  higher  than  that  obtained 
with  independent  sampling,  although  the  effect  decreases  as  the  number 
of  levels  of  the  i-facet  increases. 

Cronbach  et  al.  (1972)  also  developed  a  multivariate  predictor 
of  the  universe  score.  In  the  univariate  case,  the  universe  score 
is  estimated  from  the  regression  of  the  universe  score  on  observed 
scores  (p.  103): 


[21]  ^  =  (£p*)X  +  (1  -  CP^X 

P  pi  Pi 

In  the  multivariate  case,  the  regression  equation  for  a  particular 
dependent  variable  includes  not  only  the  observed  scores  on  that 
variable,  but  also  observed  scores  for  all  other  variables  in  the 
set.  The  multiple  regression  coefficients  are  estimated  using  linked 
or  unlinked  covariances,  depending  upon  the  anticipated  design  of  the 
D  study. 

The  set  of  multiple  regression  equations  produces  a  profile  of 
estimated  universe  scores  for  each  person.  This  profile  is  more 
reliable  (and  usually  flatter)  than  that  based  on  univariate  regression 
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equations.  In  an  example  using  data  from  the  Differential  Aptitude 
Tests  (DAT),  Cronbach  et  al.  (1972)  reported  reductions  in  error  vari¬ 
ance  as  large  as  42%  when  all  subtests  were  used  as  predictors  com¬ 
pared  to  error  variances  from  single  predictors!  (They  added,  however, 
that  the  number  of  predictors  used  in  each  equation  should  be  guided 
by  the  sample  size,  n  .  See  Darlington,  1978,  and  Fyans,  1978,  for 
regression  procedures  that  yield  reduced  sampling  errors  of  regression 
estimates  when  the  number  of  predictors  is  large  relative  to  the  sample 
size.)  The  important  finding  for  counseling  and  research  is  that  ob¬ 
served  profiles  and  those  estimated  from  univariate  regressions  may  be 
much  farther  from  the  true  profiles  than  multivariate  estimates. 

Surprisingly  little  has  been  published  on  multivariate  general- 
izability  theory  in  recent  years.  A  multivariate  generalizability 
coefficient  has  been  developed  for  the  limited  case  where  the  decision¬ 
maker  simply  wants  to  maximize  the  generalizability  of  a  composite. 

The  following  sections  discuss  this  multivariate  G  coefficient,  the 
interpretation  of  canonical  variates  in  multivariate  analyses,  and 
the  choice  between  univariate  and  multivariate  analyses. 

2.5.2.  Multivariate  generalizability  coefficient 

Bock  (1963,  1966;  see  also  Haggard,  1958)  and  Conger  and  Lipshitz 
(1973;  Conger,  1974)  developed  multivariate  analogues  of  test  reli¬ 
ability  for  one-facet  designs.  From  a  random-effects,  multivariate 
analysis  of  variance  of  standardized  scores,  the  multiple  discriminant 
functions  are  determined  so  as  to  maximize  the  ratio  of  between-person 
variation  to  within-person  variation.  Since  the  Bock  and  Conger  and 
Lipshitz  coefficients  do  not  differentiate  between  different  sources 
of  error  variance,  they  have  limited  utility  for  the  design  of  decision 
studies. 

The  only  multivariate  reliability  coefficient  anchored  in  general¬ 
izability  theory  was  developed  by  Joe  and  Woodward  (1976).  Their 
approach  distinguished  between  G  and  D  studies  and  generalized  the  work 
of  Bock  and  Conger  and  Lipshitz  to  a  variety  of  multifaceted  designs 
with  crossed  and  nested  facets. 
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For  the  two-facet,  fully  crossed  design,  the  multivariate  coeffi¬ 
cient  is 


[22] 
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n, '  and  n. ' 
i  J 


a  matrix  of  variance  and  covariance  components 
estimated  from  mean  square  matrices, 
the  number  of  conditions  of  facets  i  and  j  in 
a  D  study,  and 

the  vector  of  canonical  coefficients  that  maxi¬ 
mizes  the  ratio  of  between-person  to  between- 
person  plus  within-person  variance  component 
matrices. 


For  one-facet  designs  with  large  samples  and  one  condition  of  the 
facet,  Joe  and  Woodward's  coefficient  is  equivalent  to  the  coefficients 
developed  by  Bock  (1966)  and  Conger  and  Lipshitz  (1973;  see  also  Conger, 
1974).  The  value  of  Joe  and  Woodward’s  approach  is  that  it  allows  us 
to  maximize  a  generalizability  coefficient  by  assessing  the  magnitude 
of  different  sources  of  error  and  so  design  D  studies  that  reduce  the 
sources  of  large  error  variation. 

One  limitation  of  all  of  the  above  approaches  arises  when  variance 
component  matrices  are  not  positive  definite  or  positive  semidefinite. 

Joe  and  Woodward  (1976)  recommended  using  "extreme  caution"  and  suggested 
that  negative  definite  matrices  should  not  be  used  in  the  estimation  of 
variance  component  matrices  and  generalizability  coefficients.  As  ex¬ 
pected,  the  problem  with  negative  estimates  of  variance  components  in 
univariate  generalizability  extends  to  the  multivariate  case;  solutions 
need  to  be  worked  out  in  both  arenas  (see  Section  2.1). 


2.5.3.  Interpretation  of  canonical  variates 

There  is  a  set  of  canonical  coefficients  (a  )  for  each  characfer- 

~s 

istic  root  (X  )  in  [22],  Each  set  of  canonical  coefficients  defines  a 
s 
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composite  of  the  scores.  By  definition,  the  first  composite  is  the 
most  reliable  while  the  last  composite  is  the  least  reliable. 

Attention  should  be  paid  to  the  interpretation  of  these  composites 
(see,  for  example,  Fyans,  Salili,  Maehr,  &  Desai,  1980;  Harnqvist, 

1973;  Peng  &  Farr,  1976).  Conger  and  Lipshitz  (1973),  for  example, 
examined  the  canonical  coefficients  in  an  illustrative  analysis  of 
data  from  the  WISC  to  provide  further  information  about  common  diagnos¬ 
tic  interpretations  of  differences  among  subscales.  Since  all  subtests 
had  positive  weights  on  the  first,  most  reliable  composite,  they  in¬ 
terpreted  this  finding  as  providing  support  for  the  use  of  the  total 
IQ  score.  The  second  most  reliable  composite  was  not  the  expected 
contrast  between  verbal  and  performance  IQ.  Rather,  this  composite 
was  a  contrast  between  the  verbal  subtests  and  the  subtests  of  Block 
Design,  Object  Assembly,  and  Mazes.  This  contrast  was  more  reliable 
than  the  verbal-performance  contrast.  The  remaining  canonical  variates 
also  provided  unexpected  contrasts  among  subtests. 

In  using  the  multivariate  G  coefficient,  the  data,  not  the  investi¬ 
gators,  define  the  composites  of  maximum  generalizability.  This 
empirically-derived  coefficient  may  not  correspond  to  the  way  compo¬ 
sites  are  defined  by  theory  (e.g. ,  theory  of  human  abilities)  or 
practice  (interpretation  of  subtests  for  classification  of  applicants) . 
Rather,  we  would  prefer  to  estimate  the  generalizability  of  a  composite 
given  a  set  of  constraints.  Two  issues  are  involved:  determining  the 
weights  and  estimating  the  generalizability  of  the  composite. 

The  weights  can  be  determined  using  psychological  theory  or  prac¬ 
tical  application.  The  weights  might  be  a  set  of  orthogonal  coeffi¬ 
cients.  For  example,  Harnqvist  (1973)  examined  canonical  coefficients 
in  an  analysis  of  data  from  the  Primary  Mental  Abilities  (PMA)  battery. 
He  could  have  used  orthogonal  coefficients  to  weight  the  four  ability 
subtests  in  his  analysis  to  form  a  verbal-numerical  contrast,  hypo¬ 
thesized  to  be  important  in  factor  theories  of  intelligence. 

A  second  method  of  obtaining  weights  conforming  to  theory  or 
practice  is  to  establish  the  weights  using  confirmatory  maximum-like¬ 
lihood  factor  analysis  (Joreskog,  1969).  In  an  illustration  of  this 
method,  Joreskog  analyzed  scores  of  nine  mental  ability  tests.  In 
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one  example  solution,  tests  were  hypothesized  to  measure  three  factors 
— visualization,  verbal  intelligence,  and  speed.  The  first  model  speci¬ 
fied  three  factors  with  loadings  reflecting  the  hypothesized  structure. 
The  resulting  statistic  indicated  a  poor  fit  to  the  data.  Joreskog 
suggested  several  ways  to  determine  the  cause  of  the  poor  fit;  all  of 
them  involve  relaxing  one  or  more  restrictions  in  the  model  and  re¬ 
evaluating  the  fit. 

The  problem  with  repeated  testing  of  the  model  is  exactly  the 
same  problem  that  led  us  to  seek  alternatives  to  maximizing  the  ratio 
of  between-person  to  between-person  plus  within-person  variation. 

Namely,  the  changes  in  the  model  are  determined  by  the  data,  not  by 
theory  or  practice.  The  investigator  should  stop  modifying  the  coeffi¬ 
cients  before  the  resulting  composites  become  uninterpretable,  even 
if  the  fit  to  the  data  is  poor. 

These  approaches  to  determining  the  weights  of  the  variables  in 
composites  may  involve  a  tradeoff  between  interpretability  and  general- 
izability.  We  emphasize  interpretability;  a  composite  that  is  general- 
izable  but  not  interpretable  will  not  be  of  much  use. 

With  respect  to  estimating  the  generalizability  of  the  composite 
formed  either  a  priori  or  by  an  empirical  test  of  goodness  of  fit, 
the  most  straightforward  approach  is  a  univariate  rather  than  a  multi¬ 
variate  analysis.  The  results  of  a  univariate  generalizability  analysis 
would  be  identical  to  those  of  a  multivariate  generalizability  analysis 
in  which  the  weights  of  the  composite  define  the  a  vector  in  [22]. 

2. 5. A.  Relation  between  multivariate  and  univariate  G  theory 

When  multiple  scores  are  conceived  of  as  composites,  a  multi¬ 
variate  generalizability  analysis  is  appropriate  because  it  explicitly 
takes  into  account  the  covariation  among  the  scores.  By  partitioning 
the  variance-covariance  matrix  for  observed  scores  into  matrices  of 
components  of  variance  and  covariance  for  universe  scores  and  error, 
the  investigator  can  identify  major  sources  of  error  variation  and 
covariation,  essential  information  for  designing  an  optimal  D  studv. 

In  terms  of  a  generalizability  coefficient,  if  the  decisionmaker’s 
interest  is  in  obtaining  a  composite  with  maximum  generalizability. 


Joe  and  Woodward's  generalizability  coefficient  is  appropriate.  If 
the  decisionmaker  wants  to  assess  the  generalizability  of  a  composite 
defined  by  theory  or  practice,  a  univariate  generalizability  study  of 
the  -'omposite  will  produce  the  same  results  as  the  multivariate 
analysis  substituting  the  a  priori  weights  into  Joe  and  Woodward's 
multivariate  formula. 

The  above  discussion  does  not  address  the  generalizability  of  a 
profile  of  scores.  In  a  profile,  interest  lies  in  the  pattern  of  scores, 
not  in  a  composite  of  them.  Cronbach  et  al.  (1972)  described  in  detail 
the  estimation  of  universe  scores  in  a  profile.  In  their  formulation, 
univariate  generalizability  coefficients  of  the  scores  in  the  profile 
serve  as  one  basis  for  judging  the  generalizability  of  individual 
scores  in  the  profile.  They  further  suggested  that  the  multivariate 
approach  of  Bock  (1966) — and  so  by  implication  that  of  Joe  and  Wood¬ 
ward — can  be  used  to  reduce  and  reorganize  the  profile.  The  multi¬ 
variate  generalizability  coefficients  may  show  that  some  combinations 
of  scores  are  measured  more  reliably  than  is  needed,  while  others  are 
not  measured  with  sufficient  precision  for  the  decisionmaker's  needs. 

The  decisionmaker  can  use  this  information  to  eliminate  scores  in  the 
profile  or  to  design  ways  of  obtaining  more  generalizable  measures  of 
them. 


2.6.  Sampling  in  Observational  Measurement 

The  ability  to  estimate  the  contribution  of  multiple  sources  of 
error  affecting  observational  measurements  is  one  of  the  major  con¬ 
tributions  of  G  theory.  However,  a  problem  still  to  be  resolved  is 
how  to  allocate  observations  taking  into  account  the  linkage  arising 
from  adjacent  observations. 

The  amount  of  observation  time  can  vary  on  two  dimensions:  (a) 
facets  that  affect  the  number  of  observations,  and  (b)  facets  that 
affect  the  length  of  observation  periods.  The  problem  with  developing 
procedures  for  estimating  reliability  for  different  numbers  and  lengths 
of  observations  is  that  observations  may  be  correlated  (called  linked 
in  the  previous  section;  see  also  Cronbach  &  Furby,  1970,  p.  69; 
Cronbach  et  al.,  1972,  p.  268ff).  That  measures  obtained  on  the  same 
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day  may  not  be  independent  is  intuitively  reasonable.  Not  as  obvious, 
perhaps,  is  the  linkage  among  measures  obtained  on  different  occasions. 
The  linkage,  as  in  any  time  series  problem,  is  a  matter  of  degree. 
Measures  of  teacher  behavior  obtained  at  different  times  during  a  class 
period,  for  example,  may  agree  more  than  measures  obtained  at  different 
times  during  a  school  day,  which  are,  in  turn,  likely  to  agree  more 
than  measures  obtained  on  different  days. 

Different  degrees  of  linkage  can  occur  even  within  a  short  span 
of  time,  as  is  illustrated  in  Table  3.  Webb  (1980)  observed  the  pro¬ 
portion  of  each  minute  that  a  student  worked  alone  without  communicating 
with  other  students  or  with  the  teacher — a  variable  commonly  observed. 
Behavior  of  junior  high  school  students  was  recorded  during  five-minute 
segments.  To  determine  whether  behavior  during  adjacent  one-minute 
intervals  within  a  segment  was  more  similar  than  behavior  from  intervals 
that  were  further  separated,  the  matrix  of  intercorrelations  among  the 
scores  for  one-minute  intervals  was  calculated.  The  matrix  exhibits  a 
simplex  pattern.  Thus,  different  degrees  of  linkage  are  clearly  evi¬ 
dent,  especially  within  a  time  period  as  short  as  five  minutes. 

Table  3 

CORRELATIONS  OF  STUDENT  BEHAVIOR  ("WORKS  ALONE") 

CALCULATED  FOR  CONSECUTIVE  ONE-MINUTE  INTERVALS 

(N  =  50) 


Minute 

2 

3 

4 

5 

1 

Ln 

00 

.43 

.34 

-.12 

2 

.56 

.32 

.10 

3 

.52 

.06 

4 

.23 

The  simplex  pattern  just  described  extended  to  observations  made 
at  different  times  on  one  day  and  to  observations  made  on  different 
days.  As  expected,  the  correlations  between  observations  drawn  from 


38 


different  five-minute  segments  of  the  same  class  hour  were  lower  than 
for  adjacent  one-minute  intervals.  Correlations  between  one-minute 
observations  on  different  days  were  lower  still. 

The  only  attempt  thus  far  to  estimate  the  effect  on  reliability 
of  varying  the  length  and  number  of  observation  periods  has  been  made 
by  Rowley  (1976,  1978).  Rowley  (1978)  described  the  effects  of  vary¬ 
ing  number  and  length  of  observations  separately  and  simultaneously. 
Unfortunately,  Rowley  treated  observation  periods  as  if  they  were  un¬ 
linked,  and  so  applied  the  Spearman-Brown  formula  inappropriately. 

The  more  general  questions  that  need  to  be  addressed  in  further 
research  are  (1)  how  can  the  correlation  between  observation  periods 
occurring  closer  or  further  in  time  be  taken  into  account,  and  (2) 
does  the  recognition  of  linked  conditions  of  a  facet  make  a  difference 
in  the  theory  or  in  the  analysis? 

A  potential  solution  is  the  following.  First,  consider  length 
as  a  vector  of  scores  corresponding  to  different  durations  of  obser¬ 
vation  time  or  consecutive  observation  intervals.  This  is  tantamount 
to  defining  the  universe  of  generalization  to  include  multiple  scores 
as  well  as  multiple  sources  of  measurement  error.  Thus,  in  Webb's 
(1980)  observational  study  described  above,  the  proportion  of  each 
minute  spent  working  alone  in  the  five-minute  observation  period  may 
be  entered  as  a  vector  of  five  scores.  Next,  examine  the  general - 
izability  of  observational  measurements  with  a  one-facet  (number  of 
observations)  multivariate  analysis  of  variance  (see  Section  2.5). 

This  analysis  would  enable  the  decisionmaker  to  estimate  the  number 
of  observations  needed  in  a  D  study  and  the  optimal  length  of  the 
observation  period  (the  first  canonical  variate  in  the  multivariate 
generalizability  analysis)  while  taking  into  account  the  correlations 
among  observation  intervals. 

Since  most  observational  measurement  is  linked  to  some  degree, 
a  full  treatment  of  this  topic  in  G  theory  is  clearly  needed.  Until 
the  theory  and  procedures  for  handling  linked  conditions  of  a  facet 
are  developed,  the  decisionmaker  should,  at  least,  make  sure  that 
the  time  samples  in  the  D  study  match  in  duration  those  in  the  G 
study.  (For  detailed  recommendations,  see  Mitchell,  1979.)  Where 


observation  periods  in  the  G  study  differ  in  length  or  separation  in 
time  from  those  used  in  the  D  study,  estimates  of  variance  components 
and  generalizabil ity  coefficients  are  likely  to  be  overestimated  or 
underestimated  according  to  some  complex  function  of  the  magnitude  and 
direction  of  the  correlations  among  measures  from  linked  observation 
periods. 

In  this  section  we  have  assumed  that  the  phenomenon  being  studied 
remains  constant  over  observations.  If  this  assumption  holds,  then 
the  linkage  among  observations  is  due  to  correlated  error.  The  problem 
is  much  more  complex,  however,  when  the  universe  score  changes  over 
time,  as  is  the  case  in  maturation  studies  (e.g.,  Bayley,  1968). 

This  problem  is  too  large  to  be  reviewed  here.  Among  those  in¬ 
vestigating  time-dependent  phenomena  are  Bryk,  Strenio,  and  Weisberg 
(1980).  Although  they  have  not  investigated  reliability  explicitly, 
they  reviewed  traditional  analysis  strategies  used  in  the  face  of  non¬ 
equivalent  growth  systems  and  suggested  alternative  methods  of  analysis. 

Miscellaneous  Topics 

Here  we  mention  briefly  two  other  topics.  The  first  topic  is 
signal-to-noise  ratios  and  the  second  is  the  relationship  between  G 
theory  and  validity  theory. 

Signal/Noise  Ratios.  A  signal/noise  ratio  is  defined  as  the  ratio 
of  the  universe  score  variance  (signal)  to  the  error  variance  (ncise). 

It  has  been  proposed  by  Kane  and  Brennan  (1980)  and  Tavecchio  (1977; 
Elffers  &  Tavecchio,  1979)  as  a  means  for  evaluating  the  adequacy 
of  a  measurement  procedure.  Brennan  and  Kane  (1977c)  discussed  this 
ratio  for  absolute  decisions  while  Elffers  and  Tavecchio  (1979)  dis¬ 
cussed  it  for  relative  decisions.  We  mention  this  topic  in  passing 
because  we  prefer  to  de-emphasize  summary  coefficients  and  emphasize 
interpretation  of  the  components  of  error  variance  in  evaluating  a 
measurement  procedure. 

Relationship  between  G  Theory  and  Validity  Theory.  While  this 
topic  has  been  addressed  briefly  by  Cronbach  et  al.  (1972),  Cardinet 
et  al.  (in  press),  Fyans  (1977),  Guttman  and  Guttman  (1976),  McDonald 
(1978)  and  Van  der  Kamp  (1976),  a  systematic  attempt  to  integrate  G 
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theory  with  validity  theory  has  only  recently  been  reported  by  Kane 
(1980).  Kane's  treatment  is  provocative  and  elaborate,  but  too  tenta¬ 
tive  to  be  covered  in  this  review.  We  believe,  however,  that  his  formu¬ 
lation  will  set  the  stage  for  theoretical  developments  in  the  1980s. 
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3.  ILLUSTRATIVE  APPLICATION  OF  GENERALIZABILITY  THEORY 


In  this  section,  ratings  of  the  educational  requirements  of 
occupations  are  used  to  illustrate  an  application  of  generalizability 
theory.  The  study  (Webb  &  Shavelson,  1981;  Webb,  Shavelson,  Shea 
&  Morel lo,  1981)  was  conducted  by  the  authors  in  conjunction  with 
the  State  of  California  Employment  Department.  Specifically,  the 
illustrations  include  a  univariate  generalizability  analysis,  a  multi 
variate  generalizability  analysis,  estimation  of  variance  components 
with  an  unbalanced  design,  and  Bayesian  estimation  of  variance  com¬ 
ponents. 

3.1.  The  Study  of  General  Educational  Development  Ratings 

The  U.S.  Department  of  Labor  developed  the  General  Educational 
Development  (GED)  scale  to  rate  the  amount  of  reasoning,  mathematics, 
and  language  abilities  needed  to  perform  various  jobs.  GED  ratings 
are  used  in  several  employment  and  training  situations.  For  example, 
they  provide  the  basis  for:  (a)  estimates  of  time  required  to  learn 
job  skills,  (b)  state  employment  agencies’  decisions  to  refer  persons 
to  specific  employers,  job  training  programs,  or  remedial  education 
programs,  and  (c)  equating  jobs  that  have  similar  educational  require 
ments. 

In  this  study,  job  analysts  were  given  written  descriptions  of 
jobs,  published  in  the  Dictionary  of  Occupational  Titles,  and  were 
asked  to  rate  the  jobs  on  three  components  of  the  GED  scale:  reason¬ 
ing  development,  mathematics  development,  and  language  development. 
Each  component  was  measured  on  a  six-point  scale.  Each  of  71  raters 
from  11  geographic  field  centers  across  the  U.S.  evaluated  the  three 
components  of  a  sample  of  jobs  on  two  occasions.  Different  centers 
had  different  numbers  of  job  analysts,  ranging  from  two  to  twelve. 
Hence,  the  G  study  design  was  a  partially  nested,  unbalanced  design 
with  different  numbers  of  raters  nested  within  centers.  In  order  to 
illustrate  G  theory  in  its  basic  form,  a  random  sample  of  two  raters 
from  each  center  was  taken  to  form  a  balanced  generalizability  design 
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The  design  of  the  study,  therefore,  was  raters  (r)  nested  within 
geographic  centers  (c) ,  crossed  with  jobs  (j)  and  occasions  (o).  Be¬ 
cause  we  are  concerned  with  estimating  the  general  educational  develop¬ 
ment  required  to  perform  each  job,  the  variance  component  for  jobs 
(Oj)  is  interpreted  as  the  universe  score  variance.  All  other  variance 
components  are  considered  measurement  error  in  this  study  since  absolute 
decisions  are  made  regarding  the  GED  requirements  of  each  job.  These 
include  the  components  for  raters  nested  within  center,  center,  occasion, 
and  all  interactions. 


3.2  Univariate  Generalizability  Analysis 

For  the  univariate  generalizability  study,  a  random  effects  analy¬ 
sis  of  variance  was  used  to  estimate  the  variance  components  contri¬ 
buting  to  the  observed  variation  in  job  ratings.  A  separate  analysis 
was  performed  for  each  component  of  the  GED  scale.  As  recommended  by 
Cronbach  et  al.  (1972),  all  negative  estimates  of  variance  components 
were  replaced  with  zero  in  calculating  the  variance  components.  For 
each  analysis,  the  components  of  variance,  the  sum  of  components  con¬ 
stituting  error  variation,  and  the  coefficient  of  generalizability  were 
3 

computed . 

Since  this  analysis  focuses  on  absolute  decisions,  the  error 
variance,  o^,  reflects  not  only  disagreements  about  the  ordering  of 
the  jobs,  but  also  reflects  differences  in  mean  ratings.  It  is  impor¬ 
tant  to  know,  for  example,  whether  raters  use  essentially  the  same 
mean  level  of  the  rating  scale  as  well  as  whether  they  rank-order  jobs 
similarly. 

Data  bearing  on  the  generalizability  of  the  ratings  of  the  jobs 
over  occasions,  raters,  and  centers  are  reported  in  Table  5  for  each 
of  the  three  GED  ratings.  The  estimated  variance  components  for  jobs 
differ  across  GED  ratings.  They  suggest  that  jobs  can  be  distinguished 
more  on  their  demands  for  language  than  on  their  demands  for  mathematics 
and  reasoning.  The  patterns  of  variance  components  contributing  to 
error  were  consistent:  raters'  ratings  accounted  for  most  of  the  error 
variation  and  occasions  and  centers  accounted  for  little.  The  patterns 
of  variance  components  suggest  that,  by  taking  the  average  rating  of 
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Table  4 

Univariate  General izability  Study  of  G.E.D.  Ratings 
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The  design  is  raters  nested  within  centers  crossed  with  jobs  and  occasions. 
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four  raters,  measurement  error  can  be  reduced  by  about  75%  (o2  in  Table 

A 

4)  and  the  generalizability  coefficients  (p2)  correspondingly  increased 
to  .86  for  reasoning,  .79  for  mathematics,  and  .85  for  language. 

The  consistent  patterns  of  results  for  the  reasoning,  mathematics, 

and  language  ratings  were  not  unexpected  since  their  correlations  are: 

X  x  X 

reasoning,  math  =  .74;  reasoning,  language  =  .84;  math,  language  = 

.73.  The  size  of  the  correlations  suggests  that  all  three  GED  ratings 

share  a  common,  underlying  factor  and  that  a  multivariate  generalizability 

coefficient  would  be  appropriate. 

3.3.  Multivariate  Generalizability  Analysis 

For  the  multivariate  generalizability  study,  a  random  effects  multi¬ 
variate  analysis  of  variance  was  performed  using  the  reasoning,  mathe¬ 
matics,  and  language  ratings  as  a  vector  of  scores.  Due  to  the  limited 
capacity  of  computer  programs  available  to  perform  the  multivariate 
analysis  and  because  geographic  center  contributed  little  to  variability 
among  job  ratings,  geographic  center  was  excluded  from  the  multivariate 
analysis.  The  design  for  this  analysis  was,  then,  raters  crossed  with 
jobs  and  occasions. 

For  each  source  of  variation  in  the  design,  variance  component 
matrices  were  computed  from  the  mean  square  matrices.  Hence,  one  matrix, 
for  example,  comprised  estimated  universe-score  variances  and  covari¬ 
ances.  All  matrices  with  negative  estimated  variance  components  (diag¬ 
onal  values)  were  set  equal  to  zero  in  further  estimation.  For  this 
analysis,  the  matrices  of  variance  components,  coefficients  of  general¬ 
izability,  and  canonical  weights  corresponding  to  each  coefricient  of 
generalizability  were  computed. 

The  estimated  variance  and  covariance  component  matrices  represent¬ 
ing  the  seven  sources  of  variation  are  presented  in  Table  5.  Only  the 
components  for  one  rater  and  one  occasion  are  included.  To  obtain  the 
results  for  four  raters,  the  components  corresponding  to  the  rater  main 
effect  and  interactions  need  only  to  be  divided  by  four. 

As  a  consequence  of  the  calculation  procedure,  the  variance  com¬ 
ponents  in  Table  5  are  the  same  as  those  produced  by  the  univariate 
analysis.  The  components  of  covariance,  however,  provide  new  information. 
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Table  $ 

Estimated  Variance  and  Covariance  Components  for  Multivariate  General izability 

f  *  _l .  .  r  a  r  n  n  _  .  • _ /  _  i  _  _  1 


Study  of  G.E.O.  Ratings  (n  *  1,  n  ■  1) 


Source  of 

Variation 

Reasoning 

Mathematics 

Language 

j 

Jobs  (J) 

.75 

| 

.64 

.66 

.88 

.74 

1.09 

Occasions  (0) 

.00 

.00 

.00 

.00 

.00 

.00 

Raters  (R) 

.03 

.03 

.09 

.03 

.05 

.05 

JO 

.00 

.00 

.00 

.00 

.00 

.00 

JR 

.12 

.11 

.13 

.09 

.07 

.11 

j 

OR 

.00 

.01 

.01 

i 

.00 

.00 

.01 

i 

JRO.e 

.21 

.07 

.29 

.11 

.in 

aThe  design  is 

raters  crossed  with  jobs 

and  occasions. 

The  large  components  for  jobs  reflect  the  underlying  correlations 
among  the  GED  components.  Jobs  that  require  high  reasoning  ability 
are  seen  by  the  raters  to  require  high  mathematics  and  language  ability. 
Whereas  the  nonzero  components  of  variance  for  raters  indicate  that  some 
raters  give  higher  ratings  than  others,  the  positive  components  of  co- 
variance  indicate  that  the  raters  who  give  higher  ratings  on  one  GED 
component  are  likely  to  give  higher  ratings  on  the  other  GED  components. 
The  positive  components  for  the  job  x  rater  interaction  suggest  that 
not  only  do  raters  disagree  about  which  jobs  require  more  ability,  but 
their  disagreement  is  consistent  across  GED  components.  The  nonzero 
components  for  error  suggest  that  the  unexplained  factors  that  contrib- 
bute  to  the  variation  of  ratings  also  contribute  to  the  covariation  be¬ 
tween  ratings.  As  expected,  the  components  of  covariance  due  to  the 
occasion  main  effect  and  interactions  are  negligible. 

Composites  of  general  educational  development  that  have  maximum 
generalizability  are  presented  in  Table  6.  When  the  generalizability 
of  GED  ratings  was  estimated  for  one  rater  and  one  occasion,  one  dimen¬ 
sion  with  a  generalizability  coefficient  exceeding  .50  emerged  from 
the  analysis.  This  dimension  is  a  verbal  composite  of  reasoning  and 
language.  The  analysis  using  four  raters  and  one  occasion  produced 
two  dimensions  with  generalizability  coefficients  exceeding  .50.  The 
first  composite  is  defined  by  reasoning  and  language.  This  composite 
has  a  generalizability  coefficient  of  .74  for  one  rater  and  .92  for 
four  raters.  As  in  the  univariate  case,  the  estimate  of  measurement 
error  is  reduced  by  75%  when  four  raters  are  used.  The  second  com¬ 
posite  is  a  contrast  between  mathematics  and  language  or,  using  more 
common  terminology,  a  verbal-quantitative  contrast.  The  estimate  of 
generalizability  for  this  contrast  is  .62  for  a  D  study  with  four 
raters  and  one  occasion. 

3.4.  Unbalanced  Designs 

The  original  design  of  the  study  was  unbalanced.  This  section 
illustrates  the  estimation  of  variance  components  for  an  unbalanced 
design.  The  unbalanced  design  analyzed  here  is  raters  nested  within 
a  random  sample  of  five  of  the  eleven  geographic  centers  crossed  with 
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Table  6 

CANONICAL  VARIATES  FOR  MULTIVARIATE  GENERALIZABILITY 
STUDY  OF  G.E.D.  RATINGS3 


G.E.D.  Component 

Canonical 

n  =  1 ,  n  =  1 
r  o 

I 

Coefficients 
nr  =  4, 

I 

n  *  1 
o 

II 

Reasoning 

.34 

.38 

.05 

Mathematics 

.06 

.06 

-1.95 

Language 

.51 

.57 

1.33 

Coefficient  of 

Generalizability  (p2) 

.74 

.92 

.62 

g 

The  design  Is  raters  crossed  with  jobs  and  occasions. 


jobs  and  occasions.  (The  restriction  to  five  centers  will  be  explained 
below.)  The  results  of  this  analysis  will  be  compared  to  those  of  two 
balanced  designs:  (1)  two  raters  randomly  sampled  from  each  of  the 
five  centers,  and  (2)  two  raters  randomly  selected  from  each  of  the  11 
centers. 

The  estimates  of  variance  components  in  the  unbalanced  design  were 
obtained  using  a  modification  of  Rao’s  MIVQUE  procedure  suggested  by 
Hartley,  Rao,  and  LaMotte  (1978;  see  section  2.1.5).  Because  the  SAS 
procedure  VARCOMP  would  require  an  excessive  amount  of  region  if  all 
11  geographic  centers  were  to  be  included  in  the  analysis,  only  a  sub¬ 
set  of  the  centers  could  be  used.  Limiting  the  amount  of  region  re¬ 
quired  to  perform  the  analysis  to  300K  bytes  of  core,  the  largest  design 
that  the  computer  program  would  run  had  five  centers  with  a  total  of  28 
raters.  For  these  five  centers,  the  number  of  raters  per  center  ranged 
from  two  to  seven.  The  results  of  the  three  analyses — unbalanced  design 
with  five  centers,  balanced  design  with  five  centers,  and  balanced  de¬ 
sign  with  11  centers — are  presented  in  Table  8.  The  estimates  of  the 
variance  components  are  much  the  same  in  the  three  analyses  (see  Table 
8).  The  primary  source  of  variability,  as  was  seen  in  the  applications 


Estimates  of  Variance  Components  from  Unbalanced  and  Balanced  Data 


The  design  is  raters  nested  within  centers  crossed  with  jobs  and  occasions. 
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discussed  previously,  was  raters'  ratings.  The  estimates  of  error 
variance  (o^)  and  generalizability  coefficients  (p2)  are  also  similar 
across  the  three  designs. 

A  few  minor  differences  are  apparent  across  the  three  analyses. 

First,  the  variance  component  for  jobs  was  smaller  in  the  balanced  de¬ 
sign  with  five  centers  than  in  the  other  two  designs.  Apparently,  the 
raters  in  this  design  used  a  smaller  range  in  their  ratings  of  jobs 
than  did  the  raters  analyzed  in  the  other  designs.  This  discrepancy 
did  not,  however,  change  the  overall  results  concerning  the  depend¬ 
ability  of  raters'  judgments.  The  second  difference  across  analyses 
was  the  negative  variance  components.  The  analyses  of  five  centers 
produced  more  negative  estimates  of  variance  components  than  the  analysis 
of  the  11  centers.  The  patterns  of  negative  estimates,  however,  were 
similar  across  the  analyses.  Components  involving  geographic  center 
were  the  most  likely  candidates  for  negative  estimation. 

To  determine  whether  the  five  centers  analyzed  in  the  unbalanced 
design  might  have  produced  atypical  results  compared  to  analyses  using 
other  subsets  of  centers,  five  additional  analyses  were  carried  out 
using  other  combinations  of  centers.  The  analyses  were  subject  to  the 
limit  of  300K  bytes  of  core.  All  of  the  additional  analyses  produced 
nearly  the  same  results  as  those  reported  in  Table  7.  The  patterns  of 
variance  components  and  the  resulting  estimates  of  error  variance  and 
generalizability  coefficients  were  very  similar.  Across  five  additional 
analyses,  the  coefficients  of  generalizability  ranged  from  .61  to  .66. 

With  the  present  methodology,  two  strategies  seem  to  be  available 
for  analyzing  reasonably  large  unbalanced  designs:  (1)  to  sample  con¬ 
ditions  of  the  nested  facet  to  produce  a  balanced  (crossed)  design,  or 
(2)  to  reduce  the  unbalanced  design.  In  the  analyses  discussed  here, 
the  two  approaches  yielded  similar  estimates  of  the  components  of  vari¬ 
ance.  Although  the  two  options  may  produce  similar  results,  the  first 
option,  sampling  to  produce  a  crossed  design,  affords  greater  flexibility 
in  choosing  computational  ,<ocedures. 

3.5.  Bayesian  Estimation 

The  Bayesian  estimates  of  modal  variance  components,  presented  in 
Section  2.1.2,  assume  a  non-informative  prior  which  includes  the  constraint 
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that  estimated  variance  components  cannot  be  zero.  Using  [16],  modal 
estimates  can  readily  be  calculated  from  the  sums  of  squares  provided 
by  an  ANOVA.  In  genera...  the  adjusted  mean  square  is  given  by: 


'X, 

V  = 


SS 

df+2 


r\, 

The  Bayesian  modal  estimates  of  the  variance  components  (a2)  can 

'V/ 

be  obtained  by  equating  the  Vs — the  adjusted  mean  squares — to  their  ex¬ 
pectations.  For  example,  consider  the  Job  x  Occasion  x  Center  x  Rater: 
Center  generalizability  study  and  the  GED  rating  for  reasoning  (see 
Table  4).  The  sums  of  squares  for  the  residual  (J0R:C)  was  63.01  and 
dffes  was  286.  The  adjusted  mean  square  is:  ^res  =  63.01/(286  +  2) 


.220  (Table  8).  The  sums  of 

FJ0C 


=  .219  and  so  o2  =  .219,  while  a2 

res  res 

squares  for  the  next  source  of  variation,  JOC,  was  48.501  and  df 

r\j 

was  260.  The  adjusted  mean  square  is:  Vj^  =  48.501/(260  +  2)  =  .185. 

Setting  VJQC  equal  to  its  expectation,  we  obtain  o2QC  =  -.0175. 

Since  Bayesian  estimates  are  constrained  to  be  greater  than  or 

equal  to  zero,  the  negative  value  of  o2^  indicates  that  VJQC  provides 

a  second  estimate  of  error  independent  of  the  estimate  provided  by 

a2  That  is,  the  expected  value  of  is  a2  +  n  a2.  Since 

res.  JOC  res  r  >[0C 

a2  _  is  constrained  to  be  zero  and  not  negative,  $ Tnr  “  cr2  .  The 
JOC  JOC  res 

Bayesian  approach  then  pools  the  two  estimates  of  measurement  error  as 


follows: 


% 


V 


res(pooled) 


SS 


res 


df 


res 


+  SS 
+  df 


JOC  . 
JOC 


Setting  Vreg  equal  to  its  expectation,  ^es(pooled)  =  -203-  This 
pooled  estimate  is  carried  through  subsequent  calculations  of  variance 
components  and  the  generalizability  coefficient.  It  is  also  used  in 
interpreting  the  results  of  the  G  study. 

In  Table  8,  the  Bayesian  estimates  of  the  modal  variance  components 
are  compared  with  the  traditional  estimates.  As  expected,  the  Bayesian 
estimates  are  slightly  smaller  than  the  traditional  estimates.  The 


Table  8 


Comparison  of  Bayesian  and  Traditional 
Estimates  of  Variance  Components:  Ratings 
of  Reasoning3 


» 

L 


L 


Source 

Bayesian  Estimates 

of  Variance  Components 

Traditional  Estimates 

of  Variance  Components 

Jobs  (J) 

.69 

.74 

Occasions(O) 

.00 

.00 

Centers (C) 

.00 

.00 

Ra  ters ( Centers )( R : C ) 

.05 

.06 

JO 

.00 

.00 

JC 

.00 

• 

o 

o 

JR:C 

.13 

.13 

OC 

.01 

.01 

0R:C 

.00 

o 

o 

• 

JOC 

.00 

.00 

J0R:C(res) 

.20 

CM 

(VI 

. 

Error  Variance 

.39 

.42 

General izability 

(iz) 

.64 

.64 

*The  design  Is  raters  nested  within  centers  crossed  with  jobs  and  occasions. 
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Bayesian  generallzability  coefficient,  however,  is  equal  to  the  tradi¬ 
tional  estimate. 

While  the  two  procedures  for  calculating  the  estimates  differ 
little,  the  assumptions  underlying  the  two  estimation  approaches  differ 
considerably.  Perhaps  the  major  difference,  in  addition  to  the  non- 
inf  ormative  prior  for  the  Bayesian  estimates,  is  the  pooling  procedure 
associated  with  the  Bayesian  estimates.  This  procedure  makes  use  of 
the  information  available  when  a  negative  estimate  arises,  something 
the  traditional  theory,  in  practice,  ignores  (see  Box  &  Tiao,  1973,  on 
problems  of  pooling  with  the  traditional  approach). 
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Introductions  to  G  theory  are  provided  by  Brennan  (1977a,  1979a, 
Brennan  and  Kane  (1980),  Erlich  and  Shavelson  (1976b),  Gillmore  (1979), 
Cardinet  and  Tourneur  (1978),  Huysamen  (1980),  Tourneur  (1978),  Tourneur 
and  Cardinet  (1977),  Van  der  Kamp  (1976),  and  Wiggins  (1973). 
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For  brief  discussions  of  G  theory  and  its  application  to  criterion- 
referenced  measurement  not  treated  here,  see  Cardinet  and  Tourneur 
(1974),  Cardinet,  Tourneur  and  Allal  (in  press),  Cronbach  (1976),  Davis 
(1974),  Kane  and  Brennan  (1977),  and  Tourneur  (1977). 
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Computer  programs  designed  specifically  for  univariate  general- 
izability  analyses  have  been  reported  by  Brennan  (1979a),  Cornilius, 
Woodward,  and  Demaree  (1976),  Erlich  and  Shavelson  (1976a),  and  Erlich 
and  Borich  (1978). 
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